Auto-Encoding Variational Neural Machine Translation

(1)

MSc Artificial Intelligence

Master Thesis

Auto-Encoding Variational

Neural Machine Translation

by

Bryan Eikema

10422803

September 13, 2018

36 EC March – September, 2018

Supervisor:

Dr W Ferreira Aziz

Assessors:

Dr E Gavves

Prof Dr K Sima’an

(2)

(3)

Abstract

Translation data is often a byproduct of mixing different sources of data. This could be intentional such as by mixing data of different domains or including back-translated monolingual data, but often also is the result of how the bilingual dataset was constructed: a combination of different documents independently translated in different translation directions, by different translators, agencies, etc. Most neural machine translation models do not explicitly account for such variation in their probabilistic model. We attempt to model this by proposing a deep generative model that generates source and target sentences jointly from a shared sentence-level latent representation. The latent representation is designed to capture variations in the data distribution and allows the model to adjust its language and translation model accordingly. We show that such a model leads to superior performance over a strong conditional neural machine translation baseline in three settings: in-domain training where the training and test data are of the same domain, mixed-domain training where we train on a mix of domains and test on each domain separately, and in-domain training where we also include synthetic (noisy) back-translated data. We furthermore extend the model to be used in a semi-supervised setting in order to incorporate target monolingual data during training. Doing this we derive the commonly employed back-translation heuristic in the form of a variational approximation to the posterior over the missing source sentence. This allows for training the back-translation network jointly with the rest of the model on a shared objective designed for source-to-target translation with minimal need of pre-processing. We find that the performance of this approach is not on par with the back-translation heuristic, but does lead to improvement over a model trained on bilingual data alone.

(4)

Acknowledgments

I would like to thank Wilker Aziz for his excellent supervision and for making this thesis possible. His supervision and enthusiasm kept me on track and kept me motivated. Furthermore, I would like to give a special thanks to Philip Schulz for our fruitful discussions that set the starting point for this thesis. I am also very grateful to Joost Bastings for his excellent practical advice on deep learning for NLP. I thank Iacer Calixto, Miguel Rios, and Khalil Sima’an for the feedback they have provided for this work and the work that preceded it. I would like to thank Efstratios Gavves for agreeing to be part of my defense committee. Finally, I thank my family, friends, and Iulia for their endless support and patience.

(5)

3 Auto-Encoding Variational Neural Machine Translation 14 3.1 Model . . . 14 3.1.1 Parameter Estimation . . . 15 3.1.2 Statistical Considerations . . . 16 3.1.3 Prediction . . . 16 3.1.4 Less-Amortized Inference . . . 17 3.2 Experiments . . . 18 3.2.1 Prediction Experiments . . . 20 3.2.2 In-Domain Experiments . . . 20 3.2.3 Mixed-Domain Experiments . . . 20 3.2.4 Back-Translation Experiments . . . 21

3.3 Analysis of the Latent Variable . . . 22

3.3.1 Predicting Quality Assessments . . . 22

3.3.2 Identifying Domains . . . 22

3.3.3 Recognizing Synthetic Data . . . 23

3.4 Related Work . . . 24

3.5 Discussion . . . 25

4 Semi-Supervised Learning 26 4.1 Semi-Supervised Joint Modeling . . . 26

4.1.1 A Naive Approach . . . 26

4.1.2 Using Variational Inference . . . 27

4.1.3 Discrete Relaxations . . . 28

4.2 A Continuous Model . . . 29

4.3 A Short Note on REINFORCE . . . 30

4.4 Experiments . . . 31

4.4.1 Experimental Setup . . . 31

4.4.2 A Comparison of Back-Translation Architectures . . . 32

4.4.3 Quality of Word Vectors . . . 32

4.4.4 Semi-Supervised Experiments . . . 33

4.4.5 Continuous Model Experiments . . . 33

4.5 Related Work . . . 33

4.6 Discussion . . . 34

(6)

A Definitions & Proofs 42

B Architecture Details 43

B.1 Source Language Model . . . 43

B.2 Translation Model . . . 44

B.3 Sentence Embedding Inference Model . . . 44

B.4 Source Inference Model . . . 45

C Validation Data Results 46

(7)

Chapter 1

Introduction

1.1 Motivation

Machine translation concerns the automatic translation of text from one language to another. Whereas machine translation has come a long way since its early rule-based days, it is still far from replacing humans in the matter (Koehn and Knowles, 2017; Isabelle et al., 2017; Müller et al., 2018). In modern machine translation systems, deep learning is used to map variable length sequences of words in one language to variable length sequences of words in another language. Like any supervised deep learning task, lots of labeled data are required in order to train these systems effectively. For machine translation, these data consist of bilingual sentence pairs that carry the same meaning in the two languages involved in translation. Sources of these data are the European Parliament, whose lengthy debates are translated into all the official European Union languages by professional human translators (Koehn, 2005), movie subtitles that have been translated in multiple languages and whose sentences were automatically aligned (Tiedemann, 2009), spoken TED talks that have been subtitled in many languages by volunteers (Cettolo et al., 2014), news articles that have appeared in multiple languages (Bojar et al., 2017a), and many more.

Oftentimes, the data used to train a machine translation system are a byproduct of mixing different sources of data. Sometimes, datasets of different domains (Sennrich et al., 2017), or even languages (Johnson et al., 2017), are combined to obtain a larger training dataset. If resources are scarce, an effective way to obtain more training data is to have a translation system automatically generate a source-side to target-side monolingual data, and mixing the resulting synthetic data with the real bilingual data (Sennrich et al., 2016a). Yet even a single dataset can be the result of combining documents translated by different translators, or even agencies, with various levels of language proficiency and potentially using different guidelines. Factors such as translation direction1 _{and translation quality are some of many factors that are typically not controlled for by machine}

translation systems.

Regular neural machine translation (NMT) systems do not explicitly model such latent factors in the data. Instead, they directly model a conditional distribution of target sentences given source sentences as a fully supervised problem. In this work we propose a deep generative model for neural machine translation that generates bilingual sentences from a shared latent space. This is a joint model over the source and target sentences that share a common continuous latent representation on the sentence-level. The latent representation has the potential to model the various sources of variation in the data, such as whether the data are from domain A or from domain B or whether the data are synthetic or real. This allows for a complex marginal likelihood that has the potential to model the differences in the data distributions, and potentially shields the model from over-exposure to noisy translation pairs (e.g in the case of mixing in synthetic data). The latent representation may also capture smaller variances in the data, e.g. the translation direction and proficiency of the translator, which could improve translation quality even on data of the same domain. We derive an efficient training algorithm for our model based om amortized variational inference using neural inference networks, making our model an instance of a variational auto-encoder.

A setting of particular interest is the inclusion of target monolingual data in training an NMT system. As mentioned before, an effective method to do this is by training a backwards translation system in a

pre-1_{Known as “translationese”, sentences translated from another language can be distinguished from sentences originally produced}

(8)

processing step as to obtain an automatically back-translated source side for the data. We would like to overcome pre-training by using semi-supervised learning. Applying semi-supervised learning to our model using variational inference, we naturally obtain a back-translation system in the form of an approximate posterior over the missing source sentence. This system is trained jointly with the source-to-target translation model on a lower bound of the source-to-target likelihood. In order to deal with difficulties regarding the reparameterization of the discrete random variables (the missing source words), we shall explore the use of continuous relaxations of these discrete variables, as well as propose a model of continuous source observations that avoids the difficulties of reparameterizing discrete random variables altogether.

1.2 Research Questions & Contributions

In this work we address the following research questions:

How can we make use of the fact that data is often a byproduct of mixing various sources of data?

In this thesis we model observations as being drawn from the marginal of a deep generative model. The latent space in this model is designed to capture sentence-level variation in the data, and can inform the language and translation model of such aspects of the data. We show superior performance over a strong conditional NMT baseline, that does not statistically model such variation, in settings in which we either intentionally mix two distinct datasets or exploit smaller variations within a single dataset.

What kind of information does such a sentence-level latent representation capture?

We perform an analysis on the latent representation learned by our model. We look at how well the latent representation can be used to distinguish different distinct sources of the data, and how good the latent rep-resentation is for estimating the quality of observations. We find that the latent reprep-resentation captures these aspects of the data, but also show results that suggest that the latent representation is capturing more nuanced aspects of the data as well.

Can we derive the back-translation heuristic as part of a well understood optimization criterion, such as maximum-likelihood estimation?

In this work we use variational inference in a semi-supervised learning setting in order to derive a back-translation system that is trained by maximizing a lower bound on the joint model log-likelihood for the source-to-target translation direction. We show the difficulties in the training of such a model and find that, whereas our model is able to improve upon a model that trains only on bilingual data, we do not match the performance improvements obtained by employing the back-translation heuristic.

1.3 Thesis Outline

In Chapter 2 we start by giving a brief overview of the necessary background knowledge relevant to this work. We discuss neural machine translation from a probabilistic perspective, take a short look at common neural network architectures for neural machine translation, and discuss automatic evaluation methods. Furthermore, we look at variational inference and the specific case of variational auto-encoders. In Chapter 3 we propose a generative joint model for neural machine translation and devise an efficient training algorithm based on amortized variational inference. We perform experiments in an in-domain, mixed-domain and synthetic data setting. We furthermore explore several methods for inferring the latent variable during prediction time. We also perform an analysis on the latent variable and look at its ability to distinguish between several data sources and its ability to predict the sentence quality. We continue in Chapter 4 by taking a semi-supervised approach to derive a method to jointly train a back-translation system and a forward translation model using variational inference. We show the problems that occur in a model with discrete source sentences and propose an alternative that models continuous source observations instead. Finally, in Chapter 5 we conclude.

1.4 Notation

We use capital Roman letters for random variables, e.g. Z, and lowercase ones for assignments of random variables, e.g. z. For probability mass functions we use uppercase P (.) and for probability density functions we use lowercase p(.). We denote dependence on deterministic parameters as subscripts, e.g. Pθ(.), or explicitly,

(9)

e.g. P (.|θ). Our observations are sequences xm1 = hx1, . . . xmi and y1n = hy1. . . yni of random draws of length

m and n respectively, where we will generally use xm1 for source sentences and y1n for target sentences. We

denote prefixes of sequences as x<i where the prefix is empty if i = 1. We often will denote expectations using

a shorthand form in which we annotate the expectations with a short form of the relevant probability function, e.g. EPX[.], where the dependencies of the probability function should be clear from context. We reserve

boldface symbols, e.g. h, for deterministic vectors and matrices and deterministic vector-valued functions, e.g. fθ(.), typically a neural network.

(10)

Chapter 2

Background

In this chapter we give an overview of the models and techniques on which this work is built. In Section 2.1 we discuss neural machine translation from a probabilistic perspective, discuss the encoder-decoder architecture, and discuss some common automatic evaluation techniques for machine translation. In Section 2.2 we give an overview of variational inference and show why we need variational inference in general. Finally, in Section 2.3 we summarize variational auto-encoders and the necessary reparameterizations for efficient training using stochastic gradient methods.

2.1 Neural Machine Translation

Machine translation concerns the automatic translation from text in one language into another. Modern trans-lation systems make use of machine learning to train models on large amounts of sentence-aligned bilingual data, also known as parallel data or a parallel corpus. From such a parallel corpus statistical relations between the two languages are derived, such that the model can use these to search for the most likely translation of a sentence. A tiny example of a parallel corpus is shown in Fig. 2.1.

Neural machine translation (NMT) makes use of neural networks for this translation process. Because machine translation maps word sequences of arbitrary length in the source language to sequences of arbitrary length in the target language, the neural network architecture needs to be able to deal with that. Recurrent neural networks (RNNs) are therefore a common choice. NMT models (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2015; Cho et al., 2014a) are typically conditional language models. The target sentence is observed and modeled given the source sentence, which is observed but not modeled1_{. Let X denote}

a random variable over the source vocabulary Vx, and Y a random variable over the target vocabulary Vy. A

source sentence is a random sequence X₁m and a target sentence a random sequence Y₁n. Conditional NMT models the following:

1_{By “modeled” we mean assumed to be drawn from a probability distribution with unknown parameters that we need to estimate.}

English Romanian

On the 14th of March 1881, Romania became a kingdom.

Pe data de 14 martie 1881, România devenea regat.

Prince Carol the first had already gained the title of Royal Highness since 1878.

Principele Carol I primise încă din anul 1878 titlul de Altet,ă Regală.

The coronation of King Carol and Queen Elizabeth was held on the 10th of May in 1881.

Încoronarea Regelui Carol şi a Reginei Elisabeta s-a făcut la 10 mai 1881.

Carol succeeded in making Romania an independent state.

Carol a reus,it să facă din România un stat

independent.

(11)

yn 1 xm 1 θ |D|

Figure 2.2: The probabilistic model of conditional neural machine translation. xm

1 represents the source sentence,

yn

1 the target sentence, and |D| the amount of sentences in the dataset. Only the target sentence is modeled,

the source sentence is observed, but not modeled.

Yj|y<j, xm1 ∼ Cat(fθ(y<j, xm1 )) (2.1)

Here, fθ(.) is a neural network, e.g. a sequence-to-sequence architecture, that computes the parameters of the

categorical distribution, namely a vector of probabilities, conditioning on the prefix and the source sentence. The neural network parameters θ are point-estimated for maximal conditional log-likelihood of the i.i.d. dataset D (see Eq. 2.2) using stochastic optimization methods such as mini-batch stochastic gradient ascent (Robbins and Monro, 1951; Bottou and Cun, 2004). The graphical model for NMT is shown in Fig. 2.2.

θ∗= arg max θ X (xm 1,yn1)∈D log Pθ(yn1|x m 1) (2.2)

2.1.1 Encoder-Decoder NMT

The parameterization of the function fθ(.) can be any kind of neural architecture, but a common choice is a

set of encoder-decoder recurrent neural networks with an attention mechanism (Bahdanau et al., 2015; Luong et al., 2015). In such an architecture there is a recurrent encoder neural network that encodes a source sentence into a sequence of fixed dimensionality vectors. The inputs to the RNN are embedded source words, where each word in the vocabulary Vxhas its own trainable embedding as a row in the embedding matrix Wemb-x. A more

powerful parameterization uses a bidirectional RNN to encode the source sentence, i.e. two RNNs that read the source sentence in both directions, of which the outputs are concatenated to obtain the source sentence representation:

ei = one-hot(xi) · Wemb-x (2.3)

hm₁ = biRNN(em₁) (2.4)

This source sentence representation hm

1 consisting of m real-valued vectors output from the bidirectional RNN

encoder, are input to an RNN decoder with an attention mechanism. At each time step the decoder RNN considers the previous hidden state, previous output word embedded using a target embedding matrix, and a context vector computed by the attention mechanism in order to compute a new hidden state and a new output word. This process is repeated until an end-of-sentence token is generated or some maximum amount of words has been reached. The attention mechanism allows the RNN decoder to focus its attention onto specific words in the source sentence for each predicted target word individually. The attention mechanism computes normalized soft attention weights for each source word. For the jth target word, the soft attention weight for the ith source word is:

αji=

exp(score(hi, sj−1))

Pm

k=1exp(score(hk, sj−1))

(2.5)

Where sj−1 is the RNN decoder hidden state of the previous time step. There are many varieties of the

scoring function, but we follow the original formulation of Bahdanau et al. (2015), which trains a single-layer feed-forward neural network to compute a score which is jointly trained with the rest of the architecture. The context vector for time step j is computed as:

(12)

cj= m

X

i=1

αjihi (2.6)

We can now compute the parameters for the categorical distribution at time step j using the decoder RNN as a function of the previous word yj−1, the current hidden state sj, and the context vector cj.

Alternative Architectures

There are alternative architectures that are less centered around the use of recurrent neural networks. In Gehring et al. (2017) a convolutional neural network is used as encoder. In Vaswani et al. (2017) a sequence to sequence architecture based solely on layers of feed-forward neural networks and a self-attention mechanism is used. We would like to note that this work is agnostic to architectural details. We choose to use an encoder-decoder RNN architecture for this work as this is a commonly used architecture and one we are familiar with.

2.1.2 Prediction

At prediction time for a given input xm

1 we would like to determine the following quantity:

arg max

yn 1

Pθ(y1n|xm1 ) (2.7)

However, given that this space is infinite as we do not know the target sentence length n, we need to approximate it somehow. Even if we restrict the length to some maximum length nmaxthis space is still impractical to search

through as its size grows exponentially with O (|Vy|nmax). Therefore, usually a greedy algorithm is employed

to search through the space for some high probability translation. The most common algorithm is known as beam-search decoding and expands a beam of fixed size k that explores the k highest probability options at once (Sutskever et al., 2014). Every time step it expands the k highest probability options with |Vy| options

for the next target word, drops all but the k current most likely options, and repeats until all k options have reached the maximum length or produced an end-of-sentence symbol. Along with this usually a length penalty is used that penalizes shorter sentences to avoid the model preferring those because of their naturally higher probability. Note that from a statistical standpoint this is a decision rule which has little to do with how the model is trained. The decision rule used here is known as maximum a posteriori (MAP) decoding (Smith, 2011), other decision rules include for example minimum Bayes-risk (MBR) decoding (Kumar and Byrne, 2004; Stahlberg et al., 2017), which can also be employed.

2.1.3 Evaluation Metrics

In order to evaluate a machine translation system we need to assess the quality of the translations that it produces. Ideally, the produced translations are assessed by human native speakers of the language. However, human evaluation is an obviously costly and time-consuming process. Automatic evaluation metrics are needed in order to tune and compare systems, and allow for quick iterative development of machine translation models. The way this is done is to compare translation candidates that some machine translation system produces with gold-standard reference translations produced by a human translator. There are several metrics that make such a comparison and assign a score to a translation candidate or a set of translation candidates, and by far the most popular one is the BLEU metric (Papineni et al., 2002). BLEU was originally shown to correlate well with human evaluations, and thus is typically used as a stand-in for human evaluation when tuning and comparing models. BLEU has its flaws, however, and is not a perfect predictor of translation quality. Therefore, many other metrics have been developed that try to improve on BLEU (Banerjee and Lavie, 2005; Snover et al., 2006; Stanojević and Sima’an, 2014). We shortly describe the BLEU and BEER (Stanojević and Sima’an, 2014) metrics, as we shall use those for the main results in this thesis. We do not intend to give a complete and detailed description of each algorithm, but instead aim to bring the general idea of the metric across along with an indication of how it is computed.

(13)

x

z θ

|D|

Figure 2.3: A latent variable model with latent variable Z that explains observations X, |D| represents the amount of data points.

BLEU

BLEU (Papineni et al., 2002), for bilingual evaluation understudy, is an automatic metric for machine translation based on n-gram precision. It counts the number of n-gram matches with a set of reference translations and divides it by the total number of n-grams. Combined for different values of n, typically between 1 and 4, it aims to account for both adequacy and fluency of translations. Because such a metric does not take any recall into account, short sentences containing a common n-gram, e.g. “of the”, would have 100% precision. To prevent this kind of “cheating” BLEU incorporates a term that penalizes sentences that are too short. This brevity penalty is computed as: BP = ( 1 if c < r exp 1−r_c if c ≥ r (2.8)

Here c is the total length of the candidate translations in the entire corpus, and r is the effective reference length: the sum of lengths of reference sentences that most closely match the candidate translation in length. This is done so that candidate translations that match a reference translation exactly achieve a maximum BLEU score. The BLEU for a corpus of candidate translations and references is then computed as:

BLEU = BP · exp 1 N N X n=1 log pn ! (2.9)

Where pn are the n-gram precisions and N is the maximum order of n-grams that we consider. Note that we

take the geometric mean of the n-gram precisions. The geometric mean becomes 0 if one precision goes to 0. The brevity penalty can also be very punishing for length deviations on small sentences. Therefore, BLEU is defined only on the document or corpus level.

BEER

The BEER metric (Stanojević and Sima’an, 2014), for better evaluation as ranking, is a sentence-level metric that uses a set of dense features to train a linear binary classifier on human judgments data (Macháček and Bojar, 2013). The trained model is used to score translation candidates against reference translations. BEER does not make use of n-gram word-level features as they tend to be very sparse, especially for higher-order n-grams. Instead, BEER uses a collection of defined features that are more dense, divided into adequacy and fluency targeted features. Among adequacy features these include precision, recall, and F1-score on matched function words and content words separately, on matched words of any type, and on character-level n-grams. It also includes ordering features to account for fluency, based on permutation trees. BEER, and other character-level metrics that have followed since, have shown to correlate very well with human judgments (Machacek and Bojar, 2014; Stanojević et al., 2015; Bojar et al., 2016b, 2017b).

(14)

2.2 Variational Inference

Consider the latent variable model shown in Fig. 2.3. Here, D = {x(n)_}|D|

n=1is some observed data that we model

as draws from a random variable X, whereas Z is some unobserved latent factor that we believe to explain the observed data. We would like to make a maximum likelihood estimate for the parameters θ. As we only have observations for X we maximize the marginal log-likelihood log pθ(x). Assuming independent and identically

distributed data, the marginal log-likelihood factors as log pθ(x) = P_nlog pθ(x(n)). To avoid clutter we will

therefore focus on a single data point and drop the superscript. The marginal likelihood for an observed data point x then is:

pθ(x) =

Z

pθ(x|z)pθ(z) dz (2.10)

For most choices of pθ(x|z) and pθ(z), this is intractable to compute. We would often also like to do posterior

inference for the latent variable2_{. The posterior over the latent variable is:}

pθ(z|x) =

pθ(x|z)pθ(z)

pθ(x)

(2.11)

We thus need the intractable marginal likelihood to compute the posterior over the latent variable, making the posterior intractable as well. Variational inference (Jordan et al., 1999; Beal, 2003) deals with this by introducing a trainable function qφ(z), parameterized by variational parameters φ, that is a valid probability

distribution over the support of Z and is cheap to evaluate and sample from. Using this function, we can derive a lower bound on the marginal log-likelihood. For some data point x, this lower-bound is derived as:

log pθ(x) = log Z pθ(x|z)pθ(z) dz (2.12a) = log Z _q φ(z) qφ(z) pθ(x|z)pθ(z) dz (2.12b) = log EqZ pθ(x|z)pθ(z) qφ(z) (2.12c) JI ≥ EqZ logpθ(x|z)pθ(z) qφ(z) (2.12d) = EqZ logpθ(z) qφ(z) + EqZ[log pθ(x|z)] (2.12e) = − KL(qφ(z)||pθ(z)) + EqZ[log pθ(x|z)] (2.12f) = L(θ, φ|x) (2.12g)

Here we have made use of the fact that the logarithm is a strictly concave function combined with Jensen’s inequality (also see Appendix A). This lower-bound on the log-likelihood L(θ, φ|x) is called the evidence lower bound (ELBO). It consists of an expected conditional log-likelihood term minus a KL divergence that measures the distance between the introduced function qφ(z) and the prior over the latent variable (also see Appendix A).

The ELBO is usually tractable to make an unbiased estimate of, as we can always approximate the expectation as well as the KL divergence by sampling from qφ(z). The goal now is to do variational optimization of

the function qφ(z) to maximize the ELBO. We can show that doing this is equivalent to minimizing the KL

2_{When doing maximum-likelihood learning in latent variable models the posterior is an essential quantity, as it is required for}

(15)

Figure 2.4: Maximizing the ELBO is equivalent to minimizing the KL between qφ(z) and pθ(z|x) by optimizing

the variational parameters φ. We start from an initial set of parameters φinit and minimize this KL to get qφ(.)

closer to the true posterior. How close we can get is limited by the parameterization of qφ(.). The original version

of this illustration is from the NIPS 2016 tutorial of David Blei, Rajesh Ranganath, and Shakir Mohamed3_.

divergence between qφ(z) and the posterior of the latent variable:

KL(qφ(z)||pθ(z|x)) = EqZ log qφ(z) pθ(z|x) (2.13a) = EqZ[log qφ(z) − log pθ(z|x)] (2.13b) = EqZ log qφ(z) − log pθ(x|z)pθ(z) pθ(x) (2.13c) = EqZ logqφ(z) pθ(z) − EqZ[log pθ(x|z)] + log pθ(x) (2.13d) = −L(θ, φ|x) + log pθ(x) (2.13e)

Note that log pθ(x) is not a function of qφ(.), and can be considered constant when optimizing qφ. We can

thus see that maximizing the ELBO is equivalent to minimizing KL(qφ(z)||pθ(z|x)). Therefore qφ(z) is often

referred to as a variational approximation to the true posterior or as an approximate posterior. A visualization of variational inference is shown in Fig. 2.4. By maximizing the ELBO, the function qφ(z) as parameterized by

φ is optimized to be as close to pθ(z|x) as it can get measured by the KL divergence, limited only by the power

of the parameterization of qφ(z).

2.3 Variational Auto-Encoders

A variational auto-encoder (VAE) is an instance of the graphical model of Fig. 2.3, where the distributions are parameterized by neural networks. The “auto-encoder” part of the name refers to the fact that we are encoding the data in some latent space using qφ(z), and are subsequently reconstructing the data using pθ(x|z). From

this perspective, we can view the ELBO in Eq. 2.12f as the expected reconstruction error EqZ[log pθ(x|z)] minus

a regularization term that forces the approximate posterior to be close to the prior.

Consider an instance of this model where we impose a standard Gaussian prior on the latent variable and let the data be distributed categorically conditioned on the latent variable. The parameters of the categorical are computed by a neural network fθ(.) parameterized by generative parameters θ. For the variational approximation

qφ(.), we can also freely choose a distribution, which we will assume to be a diagonal Gaussian. We can also let

this network condition on the observations, leading to the inference network as shown in Fig. 2.5. Its parameters are computed by two neural networks µ_φ(.) and σφ(.), parameterized by variational parameters φ.

(16)

x

z φ

N

Figure 2.5: Inference network as parameterized by φ of the VAE model.

Z ∼ N (0, I) (2.14)

X|z, θ ∼ Cat(fθ(z)) (2.15)

Z|x, φ ∼ N (µ_φ(x), diag(σφ(x)2)) (2.16)

Where with σφ(x)2 we denote the element-wise square of the standard deviation vector, and diag(.) a function

that transforms a vector to become the diagonal of a square matrix. With this model we would like to maximize the ELBO from Eq. 2.12f using stochastic gradient methods, as back-propagation is the de facto standard for training neural networks. Computing the ELBO is not a problem, as the KL divergence between two Gaussian distributions has a closed-form solution (Kingma and Welling, 2014, see Appendix B) and the expectation can be approximated with a Monte Carlo (MC) estimate using K samples Z(k)∼ qφ(z|x):

EqZ[log Pθ(x|z)] ≈ 1 K K X i=1 log Pθ(x|z(k)) (2.17)

A problem occurs, however, when trying to compute gradients through the MC estimate of the expectation. As sampling is a discrete operation we cannot backpropagate through a sample. Kingma and Welling (2014) and Rezende et al. (2014) introduce a reparameterization of the latent variable using an auxiliary random variable that does allow for back-propagation to compute gradients. For particular probability distributions from the location-scale family, such as the Gaussian distribution, this looks as follows:

z = µ_φ(x) + σφ(x) (2.18)

Here, ∼ N (0, I) and denotes the element-wise product of two vectors. If we express this reparameterization as a deterministic function gφ(, x) = µφ(x) + σφ(x) we can write the ELBO as:

L(θ, φ|x) = Ep[log Pθ(x|gφ(, x))] − KL(qφ(z|x)||pθ(z)) (2.19)

This way we have moved the randomness from the random variable Z to an auxiliary random variable that does not depend on φ, allowing us to compute gradients of log Pθ(x|gφ(, x)) with respect to φ. We can now

do back-propagation, e.g. in automatic differentiation environments such as TensorFlow (Abadi et al., 2016), to maximize the ELBO while jointly optimizing the generative and the variational parameters.

(17)

Chapter 3

Auto-Encoding Variational Neural

Machine Translation

The parallel data used in the training of an NMT system often is a byproduct of mixing various sources of data. The data could be a mix of different domains, but even within a single domain there can be many sources of variation. The data could be composed of sentences translated in different directions or by different translators, or even by different agencies with different guidelines. Another common heuristic to increase the amount of training data is to automatically translate target-side data to obtain a source-side for the data, creating an additional synthetic dataset. This synthetic data is noisy as machine translation is far from perfect. None of these factors are typically accounted for in the statistical assumptions of a neural machine translation model. We present a generative joint model for neural machine translation that includes a latent representation that jointly explains the source and target sentences. This latent representation has the potential to model variations in the data. We test our model in three settings. The first is an in-domain setting, where the test data is of the same domain as the training data. In this setting our model could capture the small variations that are present within a single dataset, e.g. individual translator traits. The second setting is that of mixed-domain data, in which we intentionally mix the domains of the training data and test its performance on the individual domains separately. The latent representation here is expected to also capture domain in this setting. Finally, the third setting includes monolingual data in the form of automatically translated synthetic data of roughly the same domain, where the latent representation could learn to distinguish between real and synthetic data. We coin our model auto-encoding variational neural machine translation (AEVNMT).

In Section 3.1 we introduce the AEVNMT model and explore several ideas for doing inference of the latent variable. In Section 3.2 we test the performance of the AEVNMT model in the discussed settings against a strong NMT baseline and a fully supervised joint NMT baseline, as well as experiment with several options for doing inference. In Section 3.3 we perform an analysis of the latent variable. We discuss related work in Section 3.4 and we conclude this chapter with a discussion in Section 3.5. The content of this chapter has also been summarized in a manuscript to be submitted to an appropriate peer reviewed conference. A pre-print is available from https://arxiv.org/abs/1807.10564.

3.1 Model

In order to capture variations in the data we introduce a continuous latent variable to the NMT model and model source and target sentences jointly. We first generate a latent sentence representation, from this generate a source sentence, and finally generate a target sentence given the latent representation as well as the source sentence. Given a source-target sentence pair from an i.i.d. dataset (xm

1, y1n) ∈ D, the model thus factorizes as:

pθ(xm1, y n

1, z) = pθ(z)Pθ(xm1 |z)Pθ(yn1|x m

1, z) (3.1)

Also see the graphical model in Figure 3.1. This implies that the joint distribution over observations is modeled as a marginal of pθ(xm1, y1n, z). Observations (xm1 , yn1) ∈ D are thus assumed to be sampled from the marginal:

(18)

yn 1 xm 1 z θ S

Figure 3.1: The generative model for the AEVNMT model.

Pθ(xm1 , y n 1) = Z pθ(z)Pθ(xm1|z)Pθ(y1n|x m 1 , z) dz (3.2)

This is the marginal likelihood of the data, which is the quantity we want to optimize for. We model the source and target data as sequences of categorical draws without Markov assumptions and predict the categorical parameters at each time step with a neural network:

The neural network is itself parameterized by θ, which we will use to bundle all the deterministic generative parameters on which the generative model depends. As the neural networks are predicting parameters of categorical distributions, they have to normalize their outputs, e.g. by using a softmax layer. Note that fθ(.)

essentially is a neural machine translation model (see Section 2.1), and gθ(.) a neural language model (Mikolov

et al., 2010), both extended by additionally conditioning on the latent sentence representation.

3.1.1 Parameter Estimation

The marginal likelihood of Eq. 3.2 is intractable to compute due to the integral over the continuous latent variable in combination with the categorical variables. We thus resort to variational inference (see Section 2.2). We introduce a variational approximation qφ(z|xm1, y

n

1) to the intractable posterior p(z|x m 1 , y

n

1). We model this

variational approximation as a diagonal Gaussian whose mean and variance are computed by neural networks using variational parameters φ:

qφ(z|xm1, y n 1) = N (z|µφ(x m 1, y n 1), diag(σφ(xm1, y n 1) 2₎₎ _(3.4)

Here we choose to infer Z from both the source and target sentences for generality, but we will discuss options for inference during test time in Sections 3.1.3 and 3.1.4. Following Section 2.2 we can derive a lower bound on the marginal log-likelihood for this model called the ELBO:

log Pθ(xm1 , y n 1) ≥ EqZ[log Pθ(x m 1|z)] + EqZ[log Pθ(y n 1|x m 1, z)] − KL(qφ(z|xm1 , y n 1)||p(z)) = L(θ, φ|xm1, y1n) (3.5)

The ELBO consists of an expected language model term, an expected translation model term and a KL di-vergence between the diagonal Gaussian approximate posterior distribution and the latent variable prior. As we use a Gaussian approximate posterior distribution, imposing a Gaussian prior on the latent variable would

(19)

be a logical choice. We opt for a standard Gaussian prior on the latent variable and thereby choose to not learn the prior. The KL divergence now has an analytical solution and can thus be computed in closed form. The expectations with respect to the approximate posterior can be reparameterized using an auxiliary variable ∼ N (0, I) so that the ELBO can be optimized to a local optimum using stochastic gradient methods (see Section 2.3).

3.1.2 Statistical Considerations

There are several theoretical benefits to modeling NMT the way we do. We assume that the data is sampled from the marginal likelihood of Eq. 3.2. Such a likelihood can be more complex than one that does not include a latent sentence representation1. Concretely, the model of the observed data Pθ(xm1, y1n|z) is dependent on the

value of z and can learn different data distributions for different values of z, and thus different language and translation models, e.g. for different domains.

Furthermore, note that regular NMT systems, by directly modeling the conditional distribution over target sentences, assume that the source data distribution is not useful for learning a translation model. Concretely, regular NMT systems assume complete independence between source language modeling parameters and source-to-target translation modeling parameters. This extends to gradient updates as well: the parameter updates of the source-to-target translation model are insensitive to how likely the source sentence is according to the source language model. A case where this is unintuitive is for example when source sentences can be noisy, such as from back-translated synthetic data or in crowd-sourced datasets.

Our model breaks this independence assumption in two ways. The first way is that we share a source language embedding matrix between the source language model and the translation model. The second way is that the language and translation models share a common latent sentence representation. Note that whereas the em-bedding matrix is a deterministic quantity global to the dataset, the latent sentence representation is stochastic and local to each data point.

3.1.3 Prediction

At test time the model needs to predict a target sentence given an input source sentence. The decision rule typically used in machine translation is known as MAP decoding (Smith, 2011), and determines the following quantity:

arg max

yn 1

log P (yn₁|xm1) (3.6)

As discussed in Section 2.1.2 exactly determining this quantity is intractable as the space is infinite, and typically some greedy search algorithm such as beam search (Sutskever et al., 2014) is employed to approximate it. However, for our model we have to make several additional approximations in order to make sure prediction times are not considerably worse than those of regular NMT.

Firstly, we search through the marginal likelihood where the equality holds due to the source sentence being constant (Eq. 3.7a). Second, instead of searching through the marginal likelihood itself, we search through the lower bound (Eq. 3.7b). We then replace the approximate posterior with an auxiliary distribution rλ(z|xm1 ) that

is not dependent on the target sentence (Eq. 3.7c). This prevents a combinatorial explosion due to conditioning on the target sentence. In Eq. 3.7d we remove all terms that do not affect the arg max. Finally, instead of MC sampling the expectation, which would require decoding multiple times, we directly search through the conditional and condition on the expected latent representation instead (Eq. 3.7e). These approximations allow for prediction times close to those of a regular NMT system, but are purely heuristics and therefore offer no

1_{Although we will not provide a formal proof here, it can be shown that the marginal likelihood P}

θ(xm1, y1n) =

R Pθ(xm1 , yn1|z)pθ(z) dz results in a superset of the distributions possible for Pθ(xm1, yn1|z) due to the non-linear dependency on

Z, resulting in a set of probability distributions that contains the original exponential family but itself is no member of an expo-nential family.

(20)

guarantees. arg max yn 1 log Pθ(y1n|x m 1 ) = arg max yn 1 log Pθ(xm1, y n 1) (3.7a) ≈ arg max yn 1 EqZ[log Pθ(x m 1, y n 1|z)] − KL(qφ(z|xm1 , y n 1)||p(z))) (3.7b) ≈ arg max yn 1 ErZ[log Pθ(x m 1, y n 1|z)] − KL(rλ(z|xm1 )||p(z)) (3.7c) = arg max yn 1 ErZ[log Pθ(y n 1|x m 1 , z)] (3.7d) ≈ arg max yn 1 log Pθ(yn1|ErZ[z] , x m 1) (3.7e)

As one of those decoding heuristics we introduced an auxiliary distribution rλ(z|xn1). It is clear that this

distribution should be approximated to be close to qφ(z|xm1, y n

1). We therefore let rλ, like the approximate

posterior, be a Gaussian whose parameters are predicted by a neural network with parameters λ. We estimate the parameters of this neural network by minimizing a positive divergence D(rλ, qφ) ≥ 0 for all λ and φ and

subtract this from the model’s objective function:

log Pθ(xm1, y n 1) ≥ L(x m 1 , y n 1|θ, φ) − D(rλ, qφ) (3.8)

Note that the lower bound on the marginal log-likelihood still holds due to the positivity constraint. This objective can be used to jointly optimize θ, φ, and λ. We explore several definitions of the divergence D(rλ, qφ).

We use a KL divergence in either direction, as well as the sum of both directions known as the Jensen-Shannon divergence (also see Appendix A). Note, however, that we do not have to use an entire distribution for rλ.

Instead, we can also make it a point estimate, for which a logical choice would be to have it approximate the approximate posterior mean. In that case we can instead subtract an L2 from the ELBO:

log Pθ(xm1, y n 1) ≥ L(x m 1 , y n 1|θ, φ) − ||rλ(xm1) − µφ(x m 1 , y n 1)|| 2 2 (3.9)

We also explore an even simpler option, where we drop the conditioning on the target sentence from the approximate posterior itself as well. In that case qφ(z|xm1, y

n

1) = qφ(z|xm1) = rλ(z|xm1). This requires no

additional computational resources, but has the disadvantage that the approximate posterior loses context and is not able to use information from the target sentence directly during training.

3.1.4 Less-Amortized Inference

Additionally, we explore a different method for doing prediction that does not require introducing an auxiliary distribution rλ. So far, we have been doing amortized inference for inferring the approximate posterior over

the latent representation, i.e. we have trained two functions µφ(xm1, yn1) and σφ(xm1, y1n) that take a data point

(xm1 , yn1) as input and predicts parameters for the approximate posterior distribution using a single set of

vari-ational parameters φ. At the other extreme, we could have separate parameters µ(n) and σ(n) for each data point individually. We propose an between, where we use two sets of inference network parameters. One in-ference network predicts the approximate posterior parameters based on only the input sentence as qφx(z|x

m 1) = N (z|µ_φ x(x m 1), diag(σφx(x m

1)2), and another inference network that uses both the source and the target sentence

to predict approximate posterior parameters as qφxy(z|x

m 1, y1n) = N (z|µφxy(x m 1, y1n), diag(σφxy(x m 1 , yn1)2)). The

inference networks are used for different non-overlapping sets of data points. For example, in Section 3.2.1 we employ the full approximate posterior network for synthetic data points and the restricted approximate posterior network for real bilingual data points. For predictions we strictly use the restricted inference network for determining the quantity:

arg max yn 1 log Pθ(yn1|x m 1 ) ≈ arg max yn 1 log Pθ(y1n|EqZ[z|φx] , x m 1 ) (3.10)

We coin this method less-amortized inference, as we are still doing amortized inference, but are using two sets of variational parameters for different sets of data points. The advantage of this method is that the restricted

(21)

yn 1 xm 1 θ S

Figure 3.2: A fully supervised joint model for neural machine translation.

network used for predictions, can be shielded from data points which we know beforehand to be noisy, such as in the case of including back-translated synthetic data during training.

3.2 Experiments

We perform experiments in three settings: one in which the test data follows the training data distribution closely (in-domain), one in which the training data is of mixed domain and we test on domains individually (mixed-domain) and one in which we include target monolingual data in the form of automatically back-translated synthetic data (back-translation). We feel that a sentence-level representation is well suited for predicting domain and distinguishing between real and synthetic data. We focus on two translation tasks: WMT’s translation of news data (Bojar et al., 2016a) and IWSLT’s translation of TED talk transcripts (Cettolo et al., 2014) and concentrate on translating between German (De) and English (En) text. We show validations results in this chapter in BLEU and test results in both BLEU and BEER. We also provide additional validation results for the test results shown in this chapter in Appendix C, and additional test results in Meteor and TER in Appendix D.

Systems

For the in-domain, mixed-domain, and back-translation experiments the baseline system is a standard con-ditional NMT system described in Section 2.1. We compare this to our auto-encoding variational NMT (aevnmt)2

. Furthermore, in order to verify that any performance improvements that the aevnmt model might have are also due to the latent space, we perform an ablation experiment in which we remove the latent variable. The graphical model of this model is shown in Fig. 3.2. The log-likelihood of this model only slightly differs from a regular conditional NMT system log-likelihood in an additional source language modeling term:

log Pθ(xm1 , y n

1) = log Pθ(xm1 ) + log Pθ(y1n|x m

1) (3.11)

We again share a source embedding matrix between the language and translation models. Note that now this is now the only commonality between the source language model and the translation model. We refer to this model as the fully supervised joint model. We provide architectural details of all our models in Appendix B.

Datasets

For bilingual data we use German-English data from the IWSLT 2014 evaluation campaign (Cettolo et al., 2014), consisting of sentence-aligned subtitles from TED and TEDx talks (spoken language), and the German-English WMT17 News Commentary (NC) v12 data (Bojar et al., 2017a). The datasets consist of 153,326 and 255,591 training sentences respectively. For the back-translation setting we use a subsample of 10 million monolingual sentences from the German and English WMT17 News Crawl articles from 2016 (Bojar et al., 2017a). For the WMT task we concatenate newstest2014 and newstest2015 for validation and development (5,172 sentence pairs) and newstest2016 for testing (2,999 sentence pairs). For the IWSLT task, we follow Ranzato et al. (2015) by using 6,969 training instances for validation and development and reporting test results on the concatenation of dev2010, dev2012 and tst2010-2012 (6, 750 sentence pairs). We tokenize all data using the tokenizer

2_{We made TensorFlow (Abadi et al., 2016) implementations of AEVNMT and the baselines available at https://github.com/}

(22)

NC IWSLT

En-De De-En En-De De-En

Dropout rate 40% 50% 50% 50%

Table 3.1: Dropout rates used for the baseline models on all datasets and directions.

NC IWSLT

Dropout rate 30% 30%

Word dropout rate 10% 20%

KL cost annealing steps 80,000 80,000

En-De development KL 5.94 8.01

Table 3.2: Hyperparameters used for the aevnmt model along with the KL divergence between the approximate posterior and prior of the latent variable on the En-De development set.

of the Moses toolkit (Koehn et al., 2007), and train a truecaser using the same toolkit on all bilingual and monolingual data in order to truecase the data. We perform BPE (Sennrich et al., 2016b) on the source and target vocabularies separately using 32,000 merge operations using all the bilingual and monolingual data to learn the BPEs. After performing BPEs we remove sentences longer than 50 tokens.

Baseline Hyperparameters

We use GRU units (Cho et al., 2014b) in all recurrent neural networks using a single layer with a dimensionality of 256. We train using a mini-batch size of 64 and use the Adam optimizer (Kingma and Ba, 2015) with the step size parameter set to 0.0003. All in-domain models are trained for 140,000 mini-batches followed by a convergence checking procedure inspired by Denkowski and Neubig (2017) in which we check for BLEU improvement after every 500 batches and stop after not finding improvement for 20 of such checks. For the mixed-domain setting we train for double the amount of mini-batches (280,000), as the dataset size has doubled. For the back-translation setting we alternate between real and synthetic batches, and therefore also train on 280,000 batches so that the model sees the same amount of real bilingual data as in the in-domain setting. After 280,000 batches the mixed-domain and back-translation models follow the same convergence checking procedure. We tuned the dropout rate on the baseline conditional NMT model for maximum BLEU on a development set separately for each dataset and translation direction. We use dropout (Srivastava et al., 2014) and have tested dropout rates between 10% and 60% with 10% intervals. The fully supervised joint model uses the same dropout rate as the baseline model. The aevnmt model instead uses a fixed dropout rate of 30% and is tuned on a set of hyperparameters to prevent the approximate posterior from collapsing to the prior, as described in the next paragraph. The dropout rates for all datasets and translation directions are summarized in Table 3.1. For decoding we use beam search with a beam width of 10 and a length penalty of 1.0.

Avoiding Collapsing to Prior

A common problem with variational auto-encoders parameterized by strong generators, such as recurrent neural networks, is that the KL term in the ELBO can cause the approximate posterior to collapse to the prior distribution (Bowman et al., 2016; Higgins et al., 2016; Sønderby et al., 2016; Alemi et al., 2018). If that happens, the approximate posterior essentially becomes independent of the data. The model instead fully relies on its generative components to explain the data. In order to avoid this from happening we roughly follow Bowman et al. (2016) by doing annealing of the KL cost term from 0 to 1 over a fixed amount of training steps, as well as doing word dropout (Iyyer et al., 2015; Kumar et al., 2016) on the word embeddings alongside regular dropout on the parameters. We tune both parameters, first separately, on 20k, 40k, 60k and 80k KL cost annealing steps (out of 140k training steps) with a constant step size from 0 to 1, and 10%, 20%, 30%, and 40% word dropout. Note that we always leave a large amount of training steps that sees the full ELBO without KL cost annealing, so that the model in the end optimizes the true objective. We picked the best two settings from both and then tuned for all four combinations on BLEU and KL divergence on a development set for each dataset separately. Doing this we found that the KL divergence between the approximate posterior and the prior on a development set increased significantly, from a KL divergence near 0 to around 6 to 8. We tuned

(23)

on En-De and reused the same hyperparameters for De-En in order to spare resources. The hyperparameters along with the KL divergence on the En-De development set are shown in Table 3.2.

3.2.1 Prediction Experiments

We experiment with several methods for making predictions as described in Sections 3.1.3 and 3.1.4. We perform all experiments on NC in the En-De direction. First, we have a look at the variants of the ELBO. Recall that in order to do predictions we require an approximation rλ that conditions on the source alone. We experiment

with a simple variant where rλ= qφ(z|xm1 ), i.e. where we restrict the approximate posterior by conditioning on

the source alone, as well as with trained distribution or a point estimate rλusing a KL divergence, JS divergence

or an L2 loss. The BLEU scores of these methods are shown in Table 3.3. The results suggest that conditioning the approximate posterior on the source alone is sufficient.

Objective NC ELBOx 14.9 ELBOxy− ||rλ(xm1) − µφ(xm1 , yn1)||22 14.8 ELBOxy− JS(rλ(z|xm1), qφ(z|xm1, y n 1)) 14.9 ELBOxy− KL(rλ(z|xm1)||qφ(z|xm1 , y n 1)) 14.7 ELBOxy− KL(qφ(z|xm1, y1n))||rλ(z|xm1)) 14.8

Table 3.3: En-De validation BLEU results of using different ELBO variants in which an approximation rλ is

used at prediction time. With ELBOxwe denote the ELBO using a q(z|xm1) restricted approximate posterior,

and ELBOxy using a q(z|xm1 , yn1) full-information approximate posterior.

We also investigate the effect of using a less-amortized setup as described in Section 3.1.4. We this time also include German monolingual news data and have a conditional model trained on De-En provide a source-side to the data. We alternate between real and synthetic batches and train for 280k training steps. Synthetic batches use a full-information approximate posterior whereas for real data batches and prediction a restricted approximate posterior only conditioning on the source side is used. The results are shown in Table 3.4. We again do not observe improvement from including a full-information approximate posterior. We will therefore not use a full-information approximate posterior in further experiments.

Model NC + News Crawl articles

aevnmt 17.7

less-amortized aevnmt 17.5

Table 3.4: En-De validation BLEU results for less-amortized inference.

3.2.2 In-Domain Experiments

In this setup we use training data that can be considered to be in-domain with respect to the test data. For the NC dataset this domain is news data and for the IWSLT dataset it is spoken TED talks. The results are shown in Tables 3.5 and 3.6. We see that the aevnmt model outperforms the conditional NMT baseline by between 0.3 and 0.7 BLEU points. We can see that part of this improvement comes from the fact that we are doing joint modeling of the source and target sentences, as the joint model also already yields up to 0.3 BLEU points improvement, but BLEU further improves when including the latent representation. BEER shows about the same pattern as BLEU on the IWSLT dataset. However, on the NC dataset it does not show a clear winner between the three models.

3.2.3 Mixed-Domain Experiments

For the mixed-domain setting we train the model on a concatenation of both the NC and the IWSLT datasets. We use a concatenation of the development sets for early stopping. We subsequently test the trained models, one for En-De and one for De-En, on the individual test sets within each domain. In this setting we thus know that there are two very different data distributions that the model gets to see. The results are shown

(24)

WMT16 En-De De-En

BLEU ↑ BEER ↑ BLEU ↑ BEER ↑

conditional 17.7 53.5 20.5 54.3

joint 18.0 53.8 20.5 54.2

aevnmt 18.4 53.7 20.8 54.1

Table 3.5: In-domain experimental results on the NC dataset.

IWSLT14 _En-De _De-En

BLEU ↑ Beer ↑ BLEU ↑ BEER ↑

conditional 23.4 59.1 28.7 60.6

joint 23.6 59.2 28.7 60.6

aevnmt 23.9 59.4 29.3 61.0

Table 3.6: In-domain experimental results on the IWSLT dataset.

in Tables 3.7 and 3.8. Note that the mixed-domain models shown are thus a single model per language pair. We see that the aevnmt model consistently outperforms the conditional model by up to 1.3 BLEU points. Especially when testing on the news domain the aevnmt model improves significantly. BEER this time follows the same story. The joint model does perform roughly the same as the aevnmt model on this setting.

En-De WMT16 IWSLT14

Model BLEU ↑ BEER ↑ BLEU ↑ BEER ↑

conditional 17.3 54.2 24.1 59.7

joint 18.1 54.7 24.7 60.0

aevnmt 18.6 55.1 24.2 59.8

Table 3.7: En-DE test results for mixed-domain training.

De-En WMT16 IWSLT14

Model BLEU ↑ BEER ↑ BLEU ↑ BEER ↑

conditional 22.2 55.7 30.6 61.8

joint 22.6 56.1 30.4 61.9

aevnmt 22.7 56.2 30.5 61.9

Table 3.8: De-En test results for mixed-domain training.

3.2.4 Back-Translation Experiments

The datasets on which we have trained thus far are relatively small for NMT standards. Therefore we also exploit monolingual data by adding synthetic back-translated data to the training data. We obtain this by back-translating target monolingual data from the News Crawl articles dataset using a conditional NMT model trained on NC in the opposite direction of the direction we are translating in. We then train the models on alternating batches of real and synthetic data and train for 280k training steps, which means these models see the same amount of real bilingual data as in the in-domain setting. We show the results for training each model with additional synthetic data in Table 3.9.

We observe that including synthetic data greatly improves translation performance for all models as measured by BLEU and BEER, with up to 6.5 BLEU points improvement. The joint model shows additional improvement over conditional and the aevnmt model shows even more improvement.

(25)

WMT16 En-De De-En

BLEU ↑ BEER ↑ BLEU ↑ BEER ↑

conditional 17.7 53.5 20.5 54.3

+ synthetic data 22.3 57.4 27.0 59.0

joint + synthetic data 22.3 57.5 27.2 59.1

aevnmt + synthetic data 22.6 57.6 27.9 59.3

Table 3.9: Test results for training on NC plus back-translated News Crawl articles as synthetic data.

3.3 Analysis of the Latent Variable

We will now perform an analysis on the latent sentence representation in an attempt to discover what information the latent variable is capturing about the data. We will use the mean of the approximate posterior in order to predict sentence quality and classify data sources. We compare this to the encodings produced by the translation model its source encoder, of both the conditional NMT model and the AEVNMT model.

3.3.1 Predicting Quality Assessments

We use synthetic source sentences obtained from back-translating the concatenation of De validation and test NC data. The back-translation system is the in-domain conditional NMT model trained on De-En NC data (Section 3.2.2). We obtain quality assessments by using the sentence-level evaluation metrics Meteor (Banerjee and Lavie, 2005; Denkowski and Lavie, 2011) and TER (Snover et al., 2006), using the original En validation and test data from the NC dataset as references. For each of these synthetic English sentences we compute the mean of the latent sentence representation using the approximate posterior and the source encodings of the translation model for both the conditional and the aevnmt models used in the back-translation experiments (Section 3.2.4). We use these models as they have already encountered synthetic news data during training. In order to obtain a single vector for the source encodings we experiment with both the average as well as the sum of the source encodings, the sum having the property of somewhat preserving sentence length information.

Meteor MSE TER MSE

conditional avg(hm1(xm1 |θ)) 107.70 ± 13.43 853.42 ± 937.28 conditional sum(hm1(xm1|θ) 116.42 ± 13.09 676.23 ± 609.54 aevnmt avg(hm1 (xm1|θ)) 107.77 ± 13.10 874.26 ± 1145.99 aevnmt sum(hm1 (xm1|θ)) 116.18 ± 12.26 718.83 ± 758.39 aevnmt µφ(xm1 ) 114.80 ± 12.01 855.88 ± 960.39 random features 139.37 ± 26.37 978.60 ± 1183.68

Table 3.10: MSE of 10-fold cross validation of doing Bayesian ridge regression on sentence-level Meteor and TER scores using different features extracted from the conditional and aevnmt models of the back-translation experiments. We show the mean plus-minus a 95% confidence interval.

We use these as features in a Bayesian linear regression model. We train a Bayesian ridge regression model using the default parameters of the Scikit-learn library (Pedregosa et al., 2011) and do 10-fold cross-validation to make optimal use of the small training dataset. We report the mean mean squared error plus-minus a 95% confidence interval in Table 3.10. We observe that all extracted features perform better than random features in predicting Meteor scores. The conditional model and aevnmt model are equally good at predicting Meteor scores from source encodings, whereas the aevnmt latent sentence representation does slightly worse. Predicting TER scores is too noisy to draw reliable conclusions from it.

3.3.2 Identifying Domains

We now extract the same set of features in order to train a binary classifier to predict whether a sentence is from one of two domains. We use a concatenation of En development and test sets of NC as one domain (news data), and a concatenation of En development and test sets of IWSLT as the other domain (spoken TED talks). The news domain then consists of 8,171 English sentences and the IWSLT domain consists of 13,719

(26)

conditional average source encodings

news spoken TED talks

AEVNMT mean sentence embedding news spoken TED talks

Figure 3.3: t-SNE plot of the average encoding of the conditional source encoder (left) as well as the aevnmt mean latent representation (right) on mixed-domain data. The features were first cast down to 50 dimensions using PCA before running t-SNE.

English sentences. This time we use the En-De model from the mixed-domain experiments (Section 3.2.3), as this model has seen both domains during training. We train a logistic regression classifier using the Scikit-learn implementation with default parameters and the LIBLINEAR solver (Fan et al., 2008). We perform 10-fold cross-validation and report the mean accuracy plus-minus a 95% confidence interval. The results are shown in Table 3.11. mean accuracy conditional avg(hm1 (x m 1 |θ)) 0.92 ± 0.01 conditional sum(hm1(x m 1 |θ) 0.92 ± 0.01 aevnmt avg(hm1(xm1|θ)) 0.92 ± 0.01 aevnmt sum(hm1 (xm1|θ)) 0.92 ± 0.00 aevnmt µφ(xm1 ) 0.91 ± 0.01 random features 0.61 ± 0.01

Table 3.11: Mean accuracy for logistic regression on different sets of features to predict the domain of the data as either news or spoken TED talks. The reported results are the means and a 95% confidence interval from doing 10-fold cross-validation.

We see that all features perform about equally well for predicting the domain of the data, with no particular set of features outshining the others. All features can be used to correctly identify the domain for the majority of the data. We also show a t-SNE (Maaten and Hinton, 2008) plot of the conditional average source encodings and the aevnmt mean latent representation in Figure 3.3. There we see clear clusters being formed for the different domains, with some overlap as can be expected.

3.3.3 Recognizing Synthetic Data

We repeat the same procedure as for classifying domains, this time in order to distinguish between real and synthetic data. We use a subsample of 10,000 sentences from the back-translated En News Crawl articles dataset from the back-translation experiments (see Section 3.2.4), and a subsample of 10,000 sentences from the En NC bilingual training data. We compute the same set of features as before using the En-De models from the back-translation experiments. The results measured in mean accuracy of doing 10-fold cross validation logistic regression are shown in Table 3.12.

We see that the features again all work quite well for distinguishing synthetic from real data, with no clear winner. The t-SNE plot of this data is shown in Figure 3.4. We see an even clearer divide between real data and synthetic data clusters than we were able to see for the different domains.

(27)

mean accuracy conditional avg(hm1 (xm1 |θ)) 0.91 ± 0.01 conditional sum(hm1(xm1 |θ) 0.90 ± 0.01 aevnmt avg(hm1(xm1|θ)) 0.91 ± 0.01 aevnmt sum(hm1 (xm1|θ)) 0.91 ± 0.00 aevnmt µφ(x m 1 ) 0.90 ± 0.00 random features 0.50 ± 0.01

Table 3.12: Mean accuracy results on classifying synthetic and real data using logistic regression on different sets of features. We report the mean mean accuracy plus-minus a 95% confidence interval from doing 10-fold cross-validation.

conditional average source encodings

real data synthetic data

AEVNMT mean sentence embedding real data synthetic data

Figure 3.4: t-SNE plot of the average encoding of the conditional source encoder (left) and the AEVNMT mean sentence embedding (right) on a mix of real and synthetic data. The features were cast down to 50 dimensions using PCA before applying t-SNE.

3.4 Related Work

Zhang et al. (2016) were the first to propose a VAE model for neural machine translation. They include a Gaussian distributed sentence embedding and model the data as being sampled from the marginal Pθ(yn1|xm1) =

R pθ(z|xm1)Pθ(yn1|xm1, z) dz, which makes their model a conditional deep generative model (Sohn et al., 2015).

They thus do not model the source sentence. For predictions they use the prior over the latent variable, which in their model conditions on the source sentence. They found improved performance over an attention-based NMT system (Bahdanau et al., 2015) on an English-Chinese and an English-German translation task, where the data was of a single domain. Schulz et al. (2018) their stochastic decoder model extends the model from Zhang et al. (2016) by instead using a Markov chain of latent variables, one per word in the target sentence, allowing the model to capture greater variability and model more complex conditionals.

Concurrently to this work Shah and Barber (2018) propose a model whose probabilistic formulation is identical to ours. The motivation for their model is to mix multiple language pairs for training a single model, as this has been shown to be an effective and parameter efficient method for multilingual NMT (Johnson et al., 2017). The main difference with our model, besides motivation and some architectural details, is their method for making predictions. In order to cope with the missing target sentence at test time, required for computing the approximate posterior, they initially sample a latent sentence embedding from the prior. They then produce a draft translation from this initial sentence embedding and insert it into the approximate posterior to obtain a new latent sentence embedding. They repeat the process for multiple iterations. Note that this method requires multiple calls to the inference network as well to the translation model decoder, which is expensive, whereas our model only requires a single pass through the entire network. A downside to their approach is that their approximate posterior will base its predictions on noisy draft translations, whereas it is only trained on gold-standard data.

Auto-Encoding Variational Neural Machine Translation

MSc Artificial Intelligence

Master Thesis

Auto-Encoding Variational

Neural Machine Translation

Bryan Eikema

September 13, 2018

Supervisor:

Dr W Ferreira Aziz

Assessors:

Dr E Gavves

Prof Dr K Sima’an

Acknowledgments

Contents

Chapter 1

Introduction

1.1

Motivation

1.2

Research Questions & Contributions

1.3

Thesis Outline

1.4

Notation

Chapter 2

Background

2.1

Neural Machine Translation

2.1.1

Encoder-Decoder NMT

2.1.2

Prediction

2.1.3

Evaluation Metrics

2.2

Variational Inference

2.3

Variational Auto-Encoders

Chapter 3

Auto-Encoding Variational Neural

Machine Translation

3.1

Model

3.1.1

Parameter Estimation

3.1.2

Statistical Considerations

3.1.3

Prediction

3.1.4

Less-Amortized Inference

3.2

Experiments

3.2.1

Prediction Experiments

3.2.2

In-Domain Experiments

3.2.3

Mixed-Domain Experiments

3.2.4

Back-Translation Experiments

3.3

Analysis of the Latent Variable

3.3.1

Predicting Quality Assessments

3.3.2

Identifying Domains

3.3.3

Recognizing Synthetic Data

3.4

Related Work