Co-Training Generative Neural Machine Translation Models

(1)

MSc Artificial Intelligence

Master Thesis

Co-Training Generative Neural Machine

Translation Models

by

Ruben Gerritse

10760326

December 23, 2019

36 EC February 2019 - January 2020

Supervisor:

Dr W Ferreira Aziz

Assessor:

Dr C Monz

(2)

(3)

Abstract

A huge problem for Neural Machine Translation systems is that they only achieve good performances when massive amounts of bilingual corpora are available, as their performance deteriorates in low resource conditions. Moreover, for the majority of the language pairs very little to no parallel data is available, and the creation of parallel corpora is very costly as they require specialised expertise. We propose a generative model of translation capable of learning from both bilingual and monolingual data. This takes the form of a joint model that generates both streams of the parallel data. Two generative models are trained with flipped source and target roles. In the case of parallel data, variational inference is used to complete the latent code using the posterior in a probabilistic manner. In the case of semi-supervised learning, each generative model is used to complete the missing data for the other model first. We show that this model can outperform a conditional back-translation baseline. Furthermore, we show that this model is more robust to reduction of available bilingual data than fully supervised models. Finally, we show that the latent variable is capable of capturing both synthetic as semantic information.

(4)

Acknowledgments

I would like to thank Wilker Aziz for his supervision making this thesis possible. His ideas and creativity has thought me many things during the process of this thesis. Furthermore, I am grateful to Bryan Eikema for his for his practical advice on deep NLP which helped me to start this project. I also look to thank Christof Monz for agreeing to be part of my defence committee. Finally, I would like to thank my family, friends and colleagues for their support and patience.

(5)

3 Co-trained Auto-Encoding Variational Neural Machine Translation 10 3.1 Model . . . 10 3.1.1 Parameter Estimation . . . 11 3.1.2 CoNMT . . . 12 3.1.3 CoAEVNMT . . . 12 3.1.4 Decision rules . . . 13 3.2 Experiments . . . 13 3.2.1 Ablation Experiments . . . 15 3.2.2 Test Experiments . . . 16 3.3 Analysis . . . 17 3.3.1 Perplexity . . . 17 3.3.2 Lexical distribution . . . 19

3.3.3 Predicting synthetic vs gold-standard . . . 19

3.3.4 Posterior collapse . . . 21

3.4 Related Work . . . 23

3.5 Discussion . . . 24

4 Conclusion 25 A Architecture Details 29 A.1 Translation Model . . . 29

A.2 Language Model . . . 30

A.3 Sentence Embedding Inference Model . . . 30

B Additional Experiments 31 B.1 Training curriculum . . . 31

B.2 Fixed Sentence Embedding Representation . . . 31

B.3 Input feeding the latent variable . . . 31

B.4 Latent variable to pre-output layer . . . 32

B.5 Context dropout . . . 32

(6)

Chapter 1

Introduction

1.1 Motivation

Machine translation (MT) is the field which researches the process of translating text from one language to another in an automated fashion. Modern machine translation systems are based on a deep learning approach where neural networks are used to learn a mapping from variable length sequences of words in one language to variable length sequences of words in another language (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2014). This method is known as Neural Machine Translation (NMT). NMT models are essentially conditional language models. Source sentences are observed, but not modelled, and observed target sentences are generated word-per-word by a probabilistic classifier in a recurrent manner. The models consist of millions of parameters which are point-estimated to attain a local optimum of the conditional log-likelihood of bilingual observations. Although being so over-parameterised and relying on locally optimised maximum likelihood estimates, NMT has outperformed older MT approaches for most languages in the most traditional benchmarks (Ondrej et al., 2017). Arguably to a fair extent, one of the main contributing factors to the impres-sive performance of NMT models is the availability of large amounts of high-quality bilingual (sentence-aligned) corpora.

While recent studies have shown the potential of NMT systems by achieving impressive performances on several language pairs (Wu et al., 2016; Hassan et al., 2018), other studies have shown several open challenges w.r.t. those systems (Koehn and Knowles, 2017; Isabelle et al., 2017). One huge problem is that these system only achieve good performances when massive amounts of bilingual corpora are available, as their performance dete-riorates in low resource conditions (Koehn and Knowles, 2017; Lample et al., 2018). Moreover, for the majority of the language pairs very little to no parallel data is available, and the creation of parallel corpora is very costly as they require specialised expertise.

In contrast, monolingual corpora are abundantly available for each language. Although it does not offer complete supervision required to estimate the parameters of a NMT architecture, it can be exploited to assist a translation model (TM) to learn hidden linguistic structure. In this work we propose a generative model of translation capable of learning from both bilingual and monolingual data. This takes the form of a joint model that generates both streams of the parallel data. Two generative models are trained with flipped source and target roles. In the case of parallel data, variational inference (VI) is used to complete the latent code using the posterior in a probabilistic manner. In the case of semi-supervised learning, each generative model is used to complete the missing data for the other model first. The parameters of the models are estimated jointly on the union of all data using a strategy based on the notion of co-training (Blum and Mitchell, 1998).

1.2 Research Questions & Contributions

In this work we address the following research questions:

How can we improve back-translations in order to improve the performance of NMT models?

In this thesis we approach semi-supervised learning by training two generative models, one in each direction, and have them complete monolingual data for one another. To optimise the models, we use two ELBO func-tions, where we only optimise the generative component. We show that this model can outperform the standard conditional back-translation baseline in multiple settings.

(7)

We train two proposed semi-supervised models, one semi-supervised baseline and two fully supervised baselines on a varying amounts of supervised data. We find that although all models suffer from lower performance with less data, the semi-supervised models are in overall more robust to the decline of available bilingual data as their decrease in performance is not as large as the fully supervised models.

How much does the use of a latent space influence the performance of a back-translation model?

We introduce two variations of our model: First, the Co-Trained Auto-Encoding Variational Neural Machine Translations model, where we let the two generative models be AEVNMT models, thus employing a latent space; Second, the Co-Trained Neural Machine Translations mode, where we let the two generative models be conditional NMT models, which do not employ this latent space. We find that the CoAEVNMT outperforms the CoNMT in all setting, showing the relevance of using a latent space in a back-translation model.

1.3 Thesis outline

First we give a brief overview of the required background knowledge which is relevant to this work In Chapter 2. We discuss neural machine translation from a probabilistic standpoint, describe the commonly used Conditional Neural Machine Translation model for the translation task, discuss the Back-Translation method to leverage monolingual data and take a look at the Auto-Encoding Variational Neural Machine Translation model. In Chapter 3 we propose an architecture for neural machine translation to improve back-translation and devise a training algorithm based on co-training to circumvent known problems of gradient estimation via score function estimators. We perform several ablation studies to optimise the performance of the proposed model and compare the models to several baselines. We also perform analysis on back-translations and analyse the latent space of one of the variants of our model. Finally, we conclude in Chapter 4.

1.4 Notation

Capital Roman letters denote random variables, e.g. Z, and lowercase ones denote assignments of random variables, e.g. z. Uppercase P (.) are used for probability mass functions, whereas lowercase p(.) are used for probability density functions. Dependence on deterministic parameters are denoted as either as subscripts, e.g. Pθ(.), or explicitly, e.g. P (.|θ). Observations are sequences xm1 = (x1, . . . , xm) and yn1 = (y1, . . . , yn) of random

draws of lengths m and n, where xm

1 is generally used for source sentences and y1nfor target sentences. Prefixes

of sequences are denoted as x<i, where i = 1 denotes an empty prefix. Expectations are often denoted using a

shorthand form in which the expectations are annotated with a short form of the relevant probability function, e.g. EPX[.], where the dependencies of the probability function should be clear from context. Boldface symbols,

e.g. h, are used for deterministic vectors and matrices, and also deterministic vector-valued functions, e.g. fθ(.),

(8)

Chapter 2

Background

Machine translation concerns the task of training models capable of translating text from one language into another in an automated fashion. These kind of models are usually trained on large amounts of sentence-aligned bilingual data, also known as a parallel corpus or parallel data set. By observing this kind of data, the model learns statistical relations between the two languages, which allows them to search for the likely translation given a sentence. An example of such sentence-aligned bilingual data is shown in Table 2.1.

In this chapter, several machine translation models and techniques are discussed which this work is based on. In Section 2.1, the conditional NMT models are discussed, which are the most common NMT models. In Section 2.2, back-translation is discussed, which is method to use monolingual data to improve NMT models. In Section 2.3, the Auto-Encoding Variational Neural Machine Translation model (AEVNMT) is discussed, which is a NMT model that captures variations in data with the use of variational inference.

English German

The black dog runs through the water. Der schwarze hund rennt durch das wasser. Two kids are laughing in the grass. Zwei kinder im gras lachen.

A boy strolls by a pond in a park . Ein junge schlendert in einem park an einem teich vorbei.

A man in a white shirt is playing tennis . Ein mann in einem weien hemd spielt tennis. Table 2.1: Example parallel data for English-Turkish.

2.1 Conditional Neural Machine Translation

Neural Machine Translation (NMT) models are models which use neural networks to learn the mentioned statistical relations for the translation task. These models succeed the previously popular Statistical Machine Translation (SMT) models (Brown et al., 1993; Koehn et al., 2003; Chiang, 2005), as they have shown multiple advantages over SMT models. First of all, NMT systems have a much simpler architecture than SMT systems. The SMT pipeline consists of many small sub-components which have to be tuned separately. In contrast, NMT systems only train a single, large neural network which takes as input a sentence and outputs a corresponding translation. Furthermore, NMT models have shown to be able to generalise better to unseen data. SMT models suffer from sparsity issues, as translation probabilities are based on counts of phrases that are composed of one or more words. These estimations are sparse in the case of rare or unseen phrases, which grows exponentially with the length of these phrases. Therefore, generalisation to other domains is often limited. NMT has shown to be able to overcome these problems due to its use of continuous representations for words (Bengio et al., 2003; Mikolov et al., 2010). Moreover, NMT models have shown great modelling flexibility. NMT models make no Markov assumptions about the dependencies of the words in the target sentence, which make the models sensitive to word order and position. Finally, NMT models have shown remarkable empirical success. They have outperformed SMT models for most (if not all) language pairs on the WMT17 shared tasks (Ondrej et al., 2017).

2.1.1 Model

Let X be a random variable over the source vocabulary Vx, and Y a random variable over the target vocabulary

(9)

probabilistic standpoint, translation is seen as the problem of finding a target sentence y1n, which maximises

the conditional probability of yn1 given a source sentence xm1:

arg max

y

Pθ(yn1|x m

1 ) (2.1)

NMT models typically learn this conditional distribution directly (Kalchbrenner and Blunsom, 2013; Sutskever et al., 2014; Bahdanau et al., 2014). The target sentence is observed and modelled given the source sentence, which is observed but not modelled. We will refer to these models as Conditional NMT models. Conditional NMT models define the likelihood of y given x as followed:

Pθ(y1n|x m 1 ) = n Y j=1 Cat(yj|fθ(y<j, xm1 )) (2.2)

Here, fθis a neural network parameterised by θ, which computes the parameters of the categorical distribution

conditioned on the source sentence xm

1 and the target prefix y<j. A graphical representation of this model is

displayed in Figure 2.1. xm 1 y n 1 |B| θ

Figure 2.1: Probabilistic graph of Conditional NMT. Circled nodes are random variables, shaded nodes are observed, uncircled nodes are assumed deterministic, the plate indicates i.d.d. trials and a directed edge indicates conditional dependency. |B| represents the number of sentences in the bilingual data set.

2.1.2 Architecture

Cho et al. (2014b) and Sutskever et al. (2014) proposed the popular Recurrent Neural Network (RNN) Encoder-Decoder framework for the function fθ. Such framework consists of two RNNs: the encoder and the decoder.

The encoder reads the source sentence and encodes it into an sequence of fixed dimensional vectors. The input of the encoder takes the form of a sequence of embedded words, em

1 = (e1, . . . , em), which are fixed real-valued

vectors. Each word in the vocabulary Vx has its own embedding as a row in the trainable embedding matrix

Wemb-x

xi , such that the corresponding embedding is obtained by multiplying W

emb-x

xi with the correct one hot

vector:

ei= one hot(xi) · Wemb-x (2.3)

By inputting the embeddings em

1 to the RNN encoder the source sentence representation hm1 = (h1, . . . , hm),

consisting of m real-valued vectors, is obtained:

hm1 = encoderRNN(emi ) (2.4)

The hidden states hm1 are then used as input to the decoder. At each time step, the decoder RNN takes as

input the previous hidden state and the previous output word embedding (starting with the start-of-sentence token), and computes a new hidden state and output word. This is repeated until an end-of-sentence tokens is generated or a pre-defined maximum sentence length is reached.

It is shown that the performance of this encoder-decoder decreases as the length of an input sentences increases (Cho et al., 2014a). In order to address this issue, the attention mechanism, which is an extension to the decoder architecture, has been introduced (Bahdanau et al., 2014). In contrast to the original encoder-decoder architecture, the attention mechanism does not attempt to encode a whole input sentence into a single fixed-length vector. Instead, it (soft-)searches for relevant words in the source sentence when generating a word in the translation. Given a target word, each word in the source sentence is given a weight, which is computed as followed:

(10)

αji= exp(a(hi, sj−1)) m P k=1 exp(a(hk, sj−1)) (2.5)

Here, αji is the attention weight for jth target word and the ith source word, sj−1 is the previous state of the

RNN decoder and a is a learned similarity function which scores the affinity between the target prefix and the ith source word. Bahdanau et al. (2014) parameterised the alignment model a as a feed-forward neural network. Using the attention weights, a context vector cj is computed for time-step j as followed:

cj= m

X

i=1

αjihi (2.6)

Afterwards, the context vector cj along with the previous hidden state sj−1 and the previous output word

embedding fj−1is then inputted to the decoder RNN to compute the parameters for the categorical distribution.

2.1.3 Parameter estimation

During training, the goal is to find the optimal parameters θ for the neural network fθ which maximise the

conditional log-likelihood of the i.d.d. observations in a data set D: θ∗= arg max θ X (xm 1,yn1)∈D log Pθ(yn1|x m 1) (2.7)

These parameters are point-estimated using stochastic gradient-based optimisation (Robbins and Monro, 1951; Bottou, 2010; Kingma and Ba, 2014) to attain a local maximum of the log-likelihood. For as long as fθ is

com-posed of differentiable building blocks, reverse mode differentiation can be used to obtain ∇θL(θ) automatically.

2.1.4 Prediction

At time of prediction, given a source sentence xm

1 we would like to produce a translation y1n with the highest

probability according to the trained model:

y∗= arg max yn 1 Pθ(y1n|x m 1) (2.8)

As the space of possible translations is unbounded, it is necessary to restrict the length of the target sentence n to some maximum length nmax. Furthermore, as this space grows with O(|Vy|nmax), it is common to use

a greedy algorithm to search through this space in an efficient manner. A popular greedy algorithm for this task is beam-search decoding (Sutskever et al., 2014), which considers only k hypotheses at once. At each time step, the hypotheses are extended with every word in Vy, however all except the k hypotheses with the highest

probabilities are discarded. This process is repeated until k hypotheses have reached a maximum length of nmax

or have produced the end-of-sentence symbol. Usually, a length penalty is used to penalise shorter sentences as they often have higher probability. This decision rule is an approximation to MAP decoding (Smith, 2011).

2.2 Back-translation

Sennrich et al. (2015a) proposed a method to leverage monolingual data in order to improve the performance of NMT models. Using target monolingual data, the goal is to hopefully improve fluency of the models. In this procedure, a model with flipped source and target is trained and used to translate target monolingual data to the source language. By pairing target monolingual data with syntactic generated source sentences via back-translation, they achieved substantial improvements compared to traditional NMT models.

The idea of back-translation is to train a backward trained NMT system that translates target monolingual data to the source language in a pre-processing step. During training, the synthetic sentence pairs and the original sentence pairs are mixed. As the synthetic generated data is much larger than the original data, the dataset is skewed. Therefore, a common approach is to train the NMT model on alternating batches between the real and synthetic data.

(11)

2.3 Auto-Encoding Variational Neural Machine Translation

Bilingual data used to train MT systems is often a byproduct of mixing different sources of data. For example, the data could created by mixing data of different domains in order to create a larger data set. Even within a single domain, there could be many sources of variation as the data could be composed of sentences translated in different directions, translated by different translators, or even translated by different agencies with different guidelines. In the case of scarce data, a common method to increase the size of the data is to automatically translate target-side data in order to obtain a source-side for the data. This results in noisy synthetic data which is far from perfect.

Most traditional NMT systems do not account for these variations in data, as they directly model a conditional distribution of target sentences given source sentences as a fully supervised problem. Eikema and Aziz (2019) proposed the Auto-Encoding Variational NMT (AEVNMT) which accounts for these variations. The AEVNMT is a deep generative model that generates source and target sentences from a shared sentence-level latent representation. They argue this latent representation helps account for some of the variations in the data distribution.

2.3.1 Model

In order to account for these variations in the data, a continuous latent variable is introduced to the conditional NMT model, and both the source and target sentences are modelled. The joint likelihood Pθ(xm1, y1n) of sentence

pairs is modelled as followed:

Pθ(xm1 , y1n) = Ep(z)[Pθ(xm1|z)Pθ(yn1|xm1, z)] (2.9)

First, a latent sentence representation z is generated from a standard Gaussian prior. Using this representa-tion, a source sentence xm

1 is generated. Finally, the target sentence y1n is generated from the latent sentence

representation and the source sentence. See Figure 2.2 for a graphical representation of this generative model. Here, the source and target data are modelled as sequences of categorical draws without Markov assumptions, where at each time step the parameters of the categorical distributions are predicted using neural networks:

The source sentence is generated by drawing from categorical distributions parameterised by a recurrent neural network gθ(the language model) and the target sentence is generated by drawing from categorical distributions

parameterised by a sequence-to-sequence architecture fθ(the translation model). As the parameters predicted

by the neural network are inputted to categorical distributions, they are normalised first using a softmax layer. The neural networks gθ and fθ are parameterised by θ, which denote all parameters on which the generative

model depends on.

xm

1 yn1

|B| θ

z

(12)

2.3.2 Parameter estimation

Due to the expectation over the continuous latent variable z, the joint likelihood of the data in Equation 2.9 is intractable. To deal with this, variational inference is used and a variational approximation qλ(z|xm1 , yn1) to

the intractable posterior pθ(z|xm1 , yn1) is introduced (Jordan et al., 1999; Blei et al., 2017). This variational

approximation is modelled as a diagonal Gaussian whose mean and variance are predicted by neural networks with parameters λ: qλ(z|xm1 , y n 1) = N (z|µλ(xm1, y n 1), diag(σλ(xm1, y n 1) 2₎₎ _(2.11)

See Figure 2.3 for a graphical representation of this inference model. With the use of this approximation, the parameters of the generative model θ and the inference model λ can be estimated by maximising a lower bound on the marginal log-likelihood, known as the ELBO (Jordan et al., 1999):

log Pθ(xm1 , y n 1) ≥ EqZ[log Pθ(x m 1 |z)] + EqZ[log Pθ(y n 1|x m 1, z)] − KL(qλ(z|xm1 , y n 1)kp(z)) (2.12) = L(θ, λ|xm₁, y₁n)

The ELBO consists of an expected language model term, an expected translation model term and a KL di-vergence between the Gaussian approximate posterior distribution and the Gaussian prior. As approximate posterior distribution and the latent variable prior are both Gaussians, the KL term can be computed in closed form (Kingma and Welling, 2013). Furthermore, they use a reparameterisation of the expectations with respect to the approximate posterior using an auxiliary variable ∼ N (0, 1):

z = µλ(xm1 , y n 1) | {z } u + · σλ(xm1, y n 1) | {z } s ∼ N (z|u, s2₎ _(2.13) =z − µλ(x m 1, y1n) σλ(xm1, y1n) ∼ N (0|1) (2.14)

Using this reparameterisation, the ELBO can be optimised to a local optimum using stochastic gradient methods (Kingma and Welling, 2013; Rezende et al., 2014).

xm

1 y1n

|B|

z λ1

Figure 2.3: The inference model for the AEVNMT model.

2.3.3 Prediction

As the AEVNMT model is a latent variable model, MAP decoding requires searching for yn

1 which maximises

the marginal (Equation 2.8). Eikema and Aziz (2019) proposed several approximations to make decoding as fast as for Conditional NMT. For their final approach, they replaced the approximate posterior qλ(z|xm1, y1n)

with qλ(z|xm1), thus dropping its dependency on yn1. Furthermore, they condition on the expected latent

representation and search greedily for a translation using beam search through: arg max yn 1 log Pθ(yn1|x m 1, EqZ[z]) (2.15)

(13)

Chapter 3

Co-trained Auto-Encoding Variational

Neural Machine Translation

While NMT systems have achieved impressive performances, there are still several open challenged with respect to those systems. As they rely on the availability of large amounts of bilingual corpora, their performance dete-riorates in low-resource settings. Sennrich and Zhang (2019) showed that NMT systems trained in low-resource settings are very sensitive to hyperparameters. Moreover, for the majority of the language pairs very little to no parallel data is available, and the creation of parallel corpora is very costly as they require specialised expertise. It has been shown that back-translation is a effective method to leverage monolingual data in order to improve NMT Sennrich et al. (2015a). It essentially completes monolingual data with the use of a backward trained NMT system that translates target monolingual data to the source language in a pre-processing step. However, we can expect these back-translations to poor in quality in low-resource settings as they too rely on bilingual data. Therefore, we suspect that if the performance of this back-translation component is improved, this would result in better translation models overall. The recently proposed AEVNMT (Eikema and Aziz, 2018) for-mulates translation as a generative model, which accommodates the semi-supervised setting rather naturally due to: i) It naturally accommodates to learning from monolingual data; ii) It has been shown to learn better NMT models in the supervised setting than conditional NMT systems, thus we can expect it to produce higher quality back-translations; iii) it generates sentences from a latent space, thus we hypothesise it will produce more variability (especially when employed as a back-translation component).

We approach the semi-supervised setting by training two models, each in one direction, and have them com-plete monolingual data for one another. We describe two variants of this setup, one where we let the to models be two Conditional NMT models and one where we let them be two AEVNMT models. We perform three ablation experiment on these models. In the first experiment we compare several decoding methods to generate back-translations from monolingual data. In the second experiment we look at effect of pre-training the models on bilingual data only before using the monolingual data. In the third experiment we look at the effect of converging the models on a joint metric. After these experiments, we compare the two variants against several baselines with different amounts of bilingual data.

In Section 3.1 we introduce the joint NMT model and describe two variants of this model. Then, we perform several ablation experiments on the models and test the performance of the models against three baseline models in Section 3.2. Afterwards, we perform an analysis of the latent variable in Section 3.3 and discuss related work in Section 3.4. Finally, we conclude this chapter with a discussion in Section 3.5.

3.1 Model

We propose to train two generative models of translation which are capable of learning in the semi-supervised setting from data whose inputs are missing. The two models are trained with flipped roles of source and target, and combine them through inference. One model is trained to serve predictive tasks based on Pθ1(y

n

1|xm1 ), such

as translating an observed source sentence xm

1 to a target sentence. On the other hand, the other model is

trained to serve predictive tasks based on Pθ2(x

m

1|y1n), such as translating an observed target sentence yn1 to

a source sentence. In the semi-supervised setting, where the models should learn from data whose inputs are missing, the generative treatments allows us to complete the data probabilistically.

(14)

(2019), we model the joint likelihood Pθ1(x m 1 , yn1) = Pθ1(x m 1)Pθ1(y n

1|xm1) of sentence pairs using a generative

model as shown in Figure 3.1a-left. Both the source sentence xm1 and the target sentence yn1 are observed. First

the source sentence xm

1 is generated, and then the target sentence y1n is generated from the source sentence.

Similarly, we model the joint likelihood Pθ2(x

m

1 , y1n) = Pθ2(y

n 1)Pθ2(x

m

1|y1n) of sentence pairs using a

complemen-tary generative model as shown in Figure 3.1b-left.

In the supervised setting, parameters θ1 and θ2 can be estimated simply optimising the the conditional

log-likelihood. For target monolingual data, it is additionally required to complete the missing source sentences. This is achieved with the use of an independent inference model Qλ1(x

m

1|yn1) as shown in Figure 3.1a-right.

First the source sentence xm

1 is generated, using the observed target sentence yn1 are observed. Afterwards,

once again the target sentence yn

1 can be generated from the inferred source sentence xm1. Using the idea of

variational inference (Jordan et al., 1999), parameters θ1and λ1 can be estimated by maximising a lowerbound

on the log-likelihood of the target observation y₁n: log pθ1(y n 1) ≥ EQxm₁_|λ1[log Pθ1(x m 1, y1n) − log Qλ1(x m 1|y1n)] + H(xm1 |y1n, λ1) (3.1)

In the case of source monolingual data, the missing target sentences are completed using an independent infer-ence model Qλ2(y

n

1|xm1 ) in a similar manner as shown in Figure 3.1b-right. The lowerbound in Equation 3.1

has a straight-forward analog for the source monolingual data.

xm₁ yn1 |B| θ1 xm₁ yn1 |Y| λ1

(a) Joint model with source-to-target conditional.

xm 1 y n 1 |B| θ2 xm 1 y n 1 |X | λ2

(b) Joint model with target-to-source conditional.

Figure 3.1: Graphical model for the the joint NMT: Generative models (left), semi-supervised inference models (right).

In the spirit of variational inference, we are free to choose Qλ1 and Qλ2as we like. Instead of designing separate

inference components to complete the data, we let the predictive distribution of the complementary generative model fulfil this role. That is, in the case of monolingual data, we let each generative model act as the inference model for the other:

Qλ1(x m 1|y n 1) = Pθ2(x m 1 |y n 1) (3.2) Qλ2(y n 1|x m 1) = Pθ1(y n 1|x m 1) (3.3)

3.1.1 Parameter Estimation

As shown in Equation 3.1, the ELBO is a function of the generative parameters, θi, and the variational

param-eters, λi. Ideally we optimise the ELBO with respect to both parameters sets, which would require to take the

gradient of the ELBO w.r.t. to both θi and λi. When taking the gradient of the ELBO w.r.t. to θi, we can take

the gradient inside the expectation EQxm₁_|λi as it is independent of θi, sample a x

m

1 from Qλi(x

m

1|yn1) and take

the gradient of log Pθi(x

m 1, y n 1) − log Qλi(x m 1|y n

1) w.r.t θi. However, this is not the case for λias the expectation

is dependent on this parameter set. This would require us to resort to gradient estimation via a score function estimator (also known as REINFORCE estimator). This estimator is known to suffer from high variance which makes learning really difficult.

(15)

xm 1 y n 1 |B| θ1 xm 1 y n 1 |Y| λ2

(a) CoNMT model with source-to-target direction.

xm 1 y n 1 |B| θ2 xm 1 y n 1 |X | λ1

(b) CoNMT model with target-to-source direction.

Figure 3.2: Graphical model for the CoNMT: Generative models (left), back-translation models (right).

In order to circumvent this, we alternate updates to the parameter sets θi and λi. As stated in Section 3.1, we

let the predictive distribution of the complementary generative model fulfil the role as the inference model for the other. Inspired by Blum and Mitchell (1998), we train the two generative components separately for their own task. Therefore, instead of using a single ELBO to optimise both parameter sets, we use two ELBOs (one in each direction) to optimise both networks, where we only optimise the generative component Pθi(x

m 1|y

n 1)

of those ELBOs. Then, the variational component Qλi(x

m

1 |y1n) is only updated when it plays the role of a

generative component of the other ELBO. As stated earlier, VI allows for any proposal distribution to be used as a inference model. Therefore, we are allowed to let each generative model act as the inference model for the other. This idea of training each component only for their own task and then using each component to enlarge the data set for the other is what Blum and Mitchell (1998) called co-training.

3.1.2 CoNMT

A simple variant of this model would be to use two conditional NMT models for the generative models Pθ1 and

Pθ2. In this case, we factorise the joint distribution P (x

m 1, y n 1) as followed: P (xm1 , yn1|θ1, λ1) = Pλ1(x m 1)Pθ1(y n 1|xm1 ) (3.4)

Here, we let the independent inference model Qλ1(x

m

1|yn1) act as a prior distribution Pλ1(x

m

1 ). We call the model

were we let the generative models Pθ1 and Pθ2 be conditional NMT models, the Co-Trained Neural Machine

Translations model (CoNMT). Figures 3.2 is a graphical depiction of the CoNMT model. The conditional likelihood is modelled with the generative model Pθ1(y

n

1|xm1) as shown in Figure 3.2a-left. For target monolingual

data, completion of the source sentences is achieved by the independent inference model Qλ1(x

m

1 |y1n) (see Figure

3.2a-left). The conditional likelihood is also modelled with a complementary generative model Pθ2(x

m 1|y1n) as

shown in Figure 3.2b-left. Once again, source sentences are completed with the use of an independent inference model Qλ2(y

n

1|xm1 ) (see Figure 3.2b-right). The formal algorithm is listed in Algorithm 1.

3.1.3 CoAEVNMT

An example of a joint model is the AEVNMT. As described in Section 2.3, this joint model employs a la-tent space Z, which is not a necessity but does learn a joint distribution P (xm

1, y1n) over the sentence pairs

hxm

1 , yn1i. We call the model were we let the generative models Pθ1 and Pθ2 be AEVNMT models, the

Co-Trained Auto-Encoding Variational Neural Machine Translations model (CoAEVNMT). Figures 3.3 is a graphical depiction of the CoAEVNMT model. The joint likelihood is modelled with the generative model Pθ1(x m 1 , yn1) = EP_θ1(z)[Pθ1(x m 1|z)Pθ1(y n

1|xm1, z)] (see Figure 3.3a-left). In the supervised-setting, we use VI to

approximate the posterior Pθ1(z|x

m

1, y1n) with a tractable model Qα1(z|x

m

1 ) (see Figure 3.3a-middle) to

com-plete the latent code. For target monolingual data, completing the missing source sentences is achieved with an independent inference model Qλ1(x

m

1 |y1n) (see Figure 3.3a-right). Similarly, the joint likelihood is also modelled

with the complementary generative model Pθ2(x

m

1 , yn1) = EP_θ2(z)[Pθ2(y

n

1|z)Pθ2(x

m

1|yn1, z)] (see Figure 3.3a-left).

Again, we design the inference model Qα2(z|y

n

(16)

Algorithm 1 Co-Training Algorithm for CoNMT procedure JOINT-TRAINING

while Not Converged do repeat

for Batch b in curriculum do if b is bilingual then Train Pθ1(x m 1|y n 1) using L(θ1) = P (xm 1,yn1)∈b log Pθ1(x m 1|y n 1) Train Pθ2(y n 1|xm1) using L(θ2) = P (xm 1,yn1)∈b log Pθ2(y n 1|xm1)

if b is source monolingual then Use Qλ2(y n 1|xm1) to generate back-translation yn1 Train Pθ1(x m 1|y1n) using L(θ1) = P (xm 1,yn1)∈b log Pθ1(x m 1|yn1)

if b is target monolingual then Use Qλ1(x m 1|y1n) to generate back-translation xm1 Train Pθ2(y n 1|xm1) using L(θ2) = P (xm 1,yn1)∈b log Pθ2(y n 1|xm1)

until Observed all bilingual batches once

inference model Qλ2(y

n

1|xm1 ) to complete the missing target sentences (see Figure 3.3b-right).

In the case of an AEVNMT, setting Qλi(x

m

1 |y1n) = Pθi(x

m

1|yn1) requires returning the marginal Pθi(x

m 1, y1n)

(See Equation 3.1), which is intractable. However, due to co-training and coordinate-style updates this is not necessary. We can sample the missing data xm

1 from the AEVNMT component and don’t need to gradients of

the intractable entropy.

In contrast to the CoNMT where the generative models are optimised using conditional likelihoods, the genera-tive models of the CoAEVNMT are optimised using ELBO functions. Therefore, to obtain the correct algorithm for the CoAEVNMT, we would replace the current loss functions L(θ1) and L(θ2) in algorithm 1 as followed:

In terms of computational complexity, the CoAEVNMT is similar to that of having 2 NMT models. A difference is a manageable increase in memory due to having 2 additional encoder for the inference models for the latent variable z and 2 additional RNN’s for the language models.

3.1.4 Decision rules

If we think of training the two separate components from the co-training viewpoint, we are allowed to use several options of sampling sentences to completing the data: ancestral sampling, greedy decoding, beam search. Consider the case where we have to complete monolingual target data. Then given a target sentence yn

1, we would

like to obtain a source sentence xm

1 using the the inference model Qθ1(x

m

1|yn1) with highest probability. Using

greedy decoding, we would extend the source sentence with the word with the highest probability according to our inference model. This is similar to beam decoding as explained in Section 2.1.4 where k = 1. Using ancestral sampling, instead of picking the word with the highest probability we would sample a word from the vocabulary Vx from the categorical distribution.

3.2 Experiments

To optimise and test our models, we perform several experiments. In Section 3.2.1 we performs several ablation experiments to optimise the performance of our models. In Section 3.2.2 we compare our models against several baselines in different settings. We focus on two translation tasks: Multi30k’s (Elliott et al., 2016) translation of image descriptions between English (En) and German (De), and SETIMES2’s (Tiedemann, 2012) translation of news data between English (En) and Turkish (Tr). We show validation and test results in this chapter using the BLEU metric (Papineni et al., 2002). As deep learning models can be highly sensitive to initial conditions and

(17)

xm 1 yn1 |B| θ1 z xm 1 yn1 |B| z α1 xm 1 y1n |Y| z α1 λ2

(a) CoAEVNMT model with source-to-target direction.

xm 1 yn1 |B| θ2 z xm 1 yn1 |B| z α2 xm 1 y1n |X | z α2 λ1

(b) CoAEVNMT model with target-to-source direction.

Figure 3.3: Graphical model for the CoAEVNMT: Generative models (left), inference models for supervised learning (middle), inference models for semi-supervised learning (right).

best runs may not represent fairly a models average performance, we refrain from using performance significance tests (Gelman, 2016; Gelman and Carlin, 2017; McShane et al., 2019). Instead we dedicate resources to training the models multiple times using independent random seeds. In our results, we report average and standard deviation across 4 different runs.

Systems

For our ablation studies, we use the CoNMT and the CoAEVNMT descibed in Sections 3.1.2 and 3.1.3. In our final comparison, we compare our systems to three baseline systems. The first system is the Conditional NMT system described in Section 2.1, which trains on bilingual data only. The second baseline system is the back-translation system described in 2.2, which trains on both bilingual and monolingual data. The third system is the AEVNMT system described in Section 2.3, which trains on bilingual data only but makes of use a latent space. Architectural details of the mentioned models are provided in Appendix A.

Data set

To optimise models’ hyperparameters and test the models we use the Multi30k data set (Elliott et al., 2016), which is an extension of the Flickr30K data set (Young et al., 2014), containing German translations of English image descriptions. The Multi30k data set consists of 29,000 training sentences, 1014 validation sentences and two test sets both containing 1000 sentences which are called Flickr 2016 and Flickr 2017. As monolingual data we use the additional 145,000 German image descriptions released as part of the Multi30k dataset to the task of image description generation. These descriptions were collected independently of existing English descriptions. The data sets have been lowercased and tokenised in advance using the Moses Toolkit (Koehn et al., 2007). Afterwards, BPE (Sennrich et al., 2015b) has been performed on the source and target vocabularies jointly using 10,000 merge operations using all the bilingual and monolingual data to learn the BPEs. Finally, sentences longer than 50 tokens have been removed.

(18)

Hyperparameters

For all recurrent neural networks, we use LSTM units (Hochreiter and Schmidhuber, 1997) with a single layer and a dimensionality of 256. Training is performed on mini-batches of 40 and with the use of the Adam optimiser (Kingma and Ba, 2014). The learning rate is initially set to 0.003. With the use of a learning rate scheduler, the learning rate is halved after 3 epochs if the validation BLEU has not improved by then. A minimal learning rate has been set to 1e-05 in order to avoid the learning rate becoming to small. In order to check for convergence a procedure inspired by Denkowski and Neubig (2017) is used, where the BLEU score after each epoch is checked and the training progress is halted when the BLEU score has not improved for 10 of such checks. We use a dropout rate (Srivastava et al., 2014) of 0.5 across all models, which has been tuned using the CONDITIONAL NMT baseline on the validation set. In the prediction setting, beam search is used width a beam width of 10 and a length penalty of 1.0.

The latent size of the AEVNMT model has been tuned to a size of 64. Variational auto-encoders, such as the AEVNMT, are known to often suffer from collapse of the approximate posterior to the prior (Bowman et al., 2015; Sønderby et al., 2016; Higgins et al., 2017; Alemi et al., 2017). In that case, the approximate posterior essentially ignores the data, meaning that the model behaves as a regular conditional NMT model. In order to avoid posterior collapse we use free bits (Kingma et al., 2016) and word dropout (Iyyer et al., 2015). Once again, we tuned the parameters resulting in the use of 10 nats of information and a word dropout rate of 0.1. To allow for effective use of monolingual data, several additional hyperparameters has been added to the CoNMT and CoAEVNMT models. We train on the data using an x y xy curriculum, meaning that alternate training on source monolingual, target monolingual and bilingual batches to converge. Furthermore, the models are pre-trained on the bilingual data only for 40 epochs, and after this warm-up phrase the optimisers and schedulers are reset. The parameters of the inference models Qα1 and Qα2 of the CoAEVNMT model are optimised with a

separate learning rate which has been set to 0.001. Finally, in the case of monolingual data the language model parameters of the CoAEVNMT are frozen and only the parameters of the inference and translation models are updated. In other words, for a source monolingual batch we fix the parameters of Pθ1(x

m

1) and for a target

monolingual batch the parameters of Pθ2(y

n 1).

3.2.1 Ablation Experiments

We perform three ablation experiments on the CoNMT and CoAEVNMT models. Firstly, we experiment with several decoding methods to generate back-translations. Then, we investigate at the effect of pre-training the model on bilingual data only. Afterwards, we measure the performance of using a joint metric to converge the models. In addition, we also provide validation results of several other ablation experiments in Appendix B which are not shown in this chapter.

Decoding Experiments

As the quality of back-translations influences the performance of the models, we experiment with using several decoding methods for generating back-translations to complete the monolingual data as described in Section 3.1. The decoding methods used for experimentation include ancestral sampling, greedy decoding and beam search. We train the CoAEVNMT using the different decoding methods and measure their performance on the Multi30k data set using the BLEU metric. The BLEU scores are shown in Table 3.1. The result shows that beam search outperforms greedy sampling and ancestral sampling in both directions. However, the re-sults do not show a optimal beam size as there is no beam size which consistently outperforms the other ones. As beam search shows the best results, we decide to use beam search with beam size 3 for the other experi-ments. Unlike ancestral sampling, using beam search does mean we are taking biased gradient updates (Edunov et al., 2018), however the results show enough evidence that back-translation does work better with beam search.

En-De De-En

Ancestral sampling 38.76 ± 0.59 42.54 ± 0.38 Greedy sampling 40.62 ± 0.11 44.76 ± 0.36 Beam Search with beam size 3 41.23 ± 0.36 45.05 ± 0.35 Beam Search with beam size 5 40.90 ± 0.39 45.09 ± 0.14 Beam Search with beam size 10 40.90 ± 0.46 45.17 ± 0.43

(19)

Pre-training Experiments

As we can expect poor back-translation at the start of the training phrase, they may impact the training progress negatively. Therefore, we experiment with pre-training the CoAEVNMT in a bilingual setting only before adding monolingual data. The effect of adding pre-training to the models are measured on the Multi30k data set using the BLEU metric. The BLEU scores are shown in Table 3.2. The result shows that pre-training the model does indeed improve the performance of the model, as it results in a gain of 1.31 BLEU points in the En-De direction and 0.43 in the De-En direction. Therefore, in further experiments we will pre-training our models.

En-De De-En

With pre-training 40.62 ± 0.11 44.76 ± 0.36 Without pre-training 39.31 ± 0.18 44.33 ± 0.53 Table 3.2: Validation BLEU results for pre-training experiments.

Convergence Experiments

In contrast to the back-translation techniques, our parameters are selected by optimising two step criteria at the same time, namely the BLEU in the En-De direction and the BLEU in the De-En direcion. Since we select criteria to optimise both direction at the same time, the performance when looking at the individual directions may decrease. In this experiment, we compare the use of a joint convergence metric against the use of an individual convergence metric, which is the multiplication of the BLEUen-deand BLEUde-en. The BLEU scores

are shown in Table 3.3. The results show that model selection based on a single direction does indeed improve the performance of the model for that direction, however only by a small margin: 0.19 in the En-De direction and 0.09 in the De-En direction. In further experiments, we will be using early stopping based on the joint convergence metric.

En-De De-En

Joint convergence 40.77 ± 0.14 45.51 ± 0.42 Individual convergence 40.96 ± 0.08 45.60 ± 0.12 Table 3.3: Validation BLEU results for convergence experiments.

3.2.2 Test Experiments

As stated earlier, the performance of the CoNMT and CoAEVNMT models are compared to three baseline models. We first compare the performance of the models by training on all the data of the Multi30k data set. Then, we compare the performance of the models when we change the amount of available bilingual data. Multi30k

Here we training the models on the all data of the Multi30k data set and test their performance on the two test sets of the Multi30k data set. The BLEU scores are shown in Table 3.4. First, we observe that the CoAEVNMT outperforms all baselines on both test sets and in both directions. In contrast, the CoNMT is outperformed by the Conditinal + Back-translation baseline with 0.18 BLEU points in De-En direction on the Flickr 2016 test set, although it outperforms the baseline with 0.15 BLEU points in the other direction. On the Flickr 2016 test set the CoNMT and the Back-translation model show similar performances with only 0.03-0.06 differences in BLEU points. Furthermore, we observe that all models which make use of back-translation consistently outperform the models which only train on bilingual data. Finally, we also observe that the models making use of latent variables consistently outperforming their non-latent counterparts, i.e. the CoAEVNMT consistently outperforms the CoNMT while the AEVNMT consistently outperforms the Conditional model.

Bilingual data

In this experiment, we once again compare the models on the Multi30k test sets however we vary the amount of available bilingual data. The models are trained using all bilingual data, a half of the bilingual data and a quarter of the bilingual data. The BLEU scores for all settings are shown in Figures 3.4a-3.4d.

(20)

Flickr 2016 Flickr 2017

En-De De-En En-De De-En

CONDITIONAL 37.91 ± 0.19 38.55 ± 0.26 30.71 ± 0.63 34.02 ± 0.42

AEVNMT 38.66 ± 0.35 39.07 ± 0.37 31.38 ± 0.39 35.27 ± 0.36

CONDITIONAL + BACK-TRANSLATION 38.80 ± 0.33 40.45 ± 0.27 32.03 ± 0.57 36.41 ± 0.47

CONMT 38.95 ± 0.26 40.27 ± 0.25 31.97 ± 0.54 36.44 ± 0.25

COAEVNMT 39.15 ± 0.06 40.57 ± 0.10 32.57 ± 0.48 37.17 ± 0.26

Table 3.4: Test BLEU results on the Multi30k data set.

When comparing the performance of the models when using all bilingual data with the 50% bilingual data setting, the results show that models which rely only on bilingual data suffer a relatively larger decrease in performance when using 50% of the bilingual data, in contrast to the models which train on both bilingual and monolingual data. For example, figure 3.4a shows the BLEU scores with varying amount of bilingual data in the En-De direction for the Flickr 2016 test set. The supervised models lose roughly 4 BLEU points when the data is halved, while the semi-supervised models lose around 3 BLEU points. Similar trends can be seen in 3.4b-3.4d. In the case of using 25% of the bingual data, the AEVNMT model seems to be able to cope better with this decrease of data than the Conditional model as the decrease in BLEU points for that model is smaller. In con-trast, the decrease of BLEU score for the CoNMT model in this setup is larger than the other semi-supervised models and comes closer to that of the AEVNMT model. For example, 3.4b shows that both the CoNMT and AEVNTM model lose about 3.2 BLEU points while the Conditional model loses 4.42, the Conditional model with Backtranslation loses 2.06 and the CoAEVNMT loses 1.76. Furthermore, the Conditional model with back-translation even outperforms the CoNMT in all settings with 25% of the bilingual data.

When looking at the overall performance of the models, the results shows that all semi-supervised models still outperform the supervised models in all settings. Furthermore, the CoAEVNMT model still outperforms all other models in all settings.

3.3 Analysis

In this section we perform several analyses on the of the semi-supervised models. First we calculate perplexities for generated data using a RNN language model (RNNLM) for real and synthetic data. Then we examine the lexical distributions of the synthetic data and compare them to the gold-standard data. We then train a classifier on the latent variable to distinguish between real and synthetic data. Afterwards, we perform several analysis to inverstigate the possibility of posterior collapse. All analyses are performed on the Multi30k data set.

3.3.1 Perplexity

Here we measure how well the gold-standard data is able to predict the synthetic data and vice-versa. We do so by training a RNNLM1 _{(RNNLM) on either the gold-standard or synthetic data from the semi-supervised}

models and calculate the perplexity the other data set. The synthetic data is generated greedily using beam search. As baseline we train the RNNLM on the standard data and calculate the perplexity on the gold-standard as well. We use the data from the Multi30k dataset and train the model on the concatenation of one side of the bilingual and monolingual data. The perplexity is calculated on the validation data. The resulting perplexities are shown in Table 3.5.

The results show that the perplexity when training on the gold-standard data and assessing the synthetic data is lower in both directions than the baseline, which shows that the RNNLM trained on the gold-standard data is better at predicting the synthetic data than the gold-standard data when trained on the gold-standard data. This suggest that the synthetic samples are covered by the original data distribution and are possibly only a subset. The RNNLM has the least difficulty with predicting the synthetic data from the CoAEVNMT model. Furthermore, the results show that the perplexity when training on synthetic data and assessing the gold data is higher in both directions than the baseline, which shows that the RNNLMs trained on synthetic data have more difficulty with predicting the gold-standard data than the RNNLM trained on the gold-standard data. Again, this suggests that the synthetic data cannot cover all of the original data distribution. In addition, we

(21)

Cond AEVNMT Cond+Back CoNMT CoAEVNMT 0 20 40 37.91 38.66 38.8 38.95 39.15 33.74 34.38 35.64 35.65 36.19 26.9 28.93 30.94 30.65 31.45 BLEU 100% data 50% data 25% data

(a) BLEU results for the Flickr 2016 test set in the En-De direction.

Cond AEVNMT Cond+Back CoNMT CoAEVNMT 0 20 40 38.55 39.07 40.45 40.27 40.57 36.08 36.98 38.94 38.96 39.13 31.66 33.76 36.88 _35.8 37.37 BLEU 100% data 50% data 25% data

(b) BLEU results on the Flickr 2016 test set in the De-En direction.

Cond AEVNMT Cond+Back CoNMT CoAEVNMT 0 20 40 30.71 31.38 32.03 31.97 32.57 27.59 28.5 29.87 30.26 30.95 23.97 25.11 27.64 26.91 28.21 BLEU 100% data 50% data 25% data

(c) BLEU results on the Flickr 2017 test set in the En-De direction.

Cond AEVNMT Cond+Back CoNMT CoAEVNMT 0 20 40 34.02 35.27 36.41 36.44 37.17 31.09 32.3 35.01 34.66 35.76 26.63 28.8 32.34 _31.28 33.41 BLEU 100% data 50% data 25% data

(d) BLEU results on the Flickr 2017 test set in the De-En direction.

Figure 3.4: Test BLEU results on the Flickr 2016 and Flickr 2017 test sets trained using 100%, 50% and 25% of the bilingual data.

(22)

En De Gold-Gold 22.31 30.39 Gold-Cond+Back 14.8 20.48 Gold-CoNMT 14.65 20.72 Gold-CoAEVNMT 14.67 19.63 Cond+Back-Gold 30.28 37.32 CoNMT-Gold 33.95 38.84 CoAEVNMT-Gold 33.66 40.7

Table 3.5: Perplexity of a language model trained on the Multi30k gold and synthetic data.

see that the RNNLM trained on synthetic data from the Conditional+Backtranslation model shows the least perplexity compared to the other two semi-supervised models.

3.3.2 Lexical distribution

Here we investigate the lexical distributions of the synthetic data from the semi-supervised models compared to the gold-standard data. The lexical distribution are calculated on the validation set. Figure 3.5 shows the lexical distributions of the gold-standard and the synthetic data for the English and German language. We can see for both English and German that the lexical distributions of the synthetic data for all models are similar to that of the gold-standard. We can see however for the synthetic data of all semi-supervised models that the probabilities of words which are relatively high compared to other words are slightly increased.

Table 3.6 show the forward and backward KL between the lexical distributions of the semi-supervised models and the gold-standard. As many words had zero probability for the synthetic data, we used Laplace smoothing (Sch¨utze et al., 2008) to add a small probability to these words, allowing us to calculate the backwards KL. The results shows that the lexical distribution of the Conditional+Back-translation model is the closest that of the gold-standard lexical distribution for the English language, while lexical distribution of the CoAEVNMT model is the closest that of the gold-standard lexical distribution for the German language. However, the difference between KL divergences of those models are relatively small. The KL divergences for the CoNMT model are the highest in all cases. For the English language the KL divergences are relativity high when comparing to the other models.

En De

Forward Backward Forward Backward Conditional+Back-translation 0.0477 0.0413 0.0745 0.0698

CoNMT 0.0533 0.0456 0.0766 0.0717

CoAEVNMT 0.0486 0.0415 0.0725 0.0686

Table 3.6: KL divergence between the lexical distribution of the gold-standard and the semi-supervised models.

3.3.3 Predicting synthetic vs gold-standard

Here we investigate whether there is a difference in latent code inferred from gold-standard or synthetic data. The forward (source-to-target) model learns on gold-standard x → y as well as on synthetic x0 → y, where x0 is provided by the backward model (target-to-source). If the forward model’s inference network can detect that x0 is synthetic, then it can inform the decoder that a translation task (x0, y) is somewhat different than a translation task (x, y). We test that by checking whether the posterior mean q(z|x) is predictive of whether x is gold or synthetic.

Let B = {(xi, ri)Ni=1} be a set of N sentence pairs where xiis the ith source and riits reference translation. Let

B0 _{= {(x}

i, arg maxyp(y|xi)) : (xi, ri) ∈ B}. We use the target-to-source system to infer Gaussians (posteriors)

for the target side of both B and B0. The former is a positive example (gold-standard), the later is a negative example (synthetic). We train a Support Vector Machine (Suykens and Vandewalle, 1999) to discriminate gold and synthetic based on the posterior mean.

The precision, recall and F1-scores of the classification task are shown in Table 3.7. We can see that the

classifier has difficulty with distinguishing between gold-standard and synthetic as all scores are close to 0.5. This suggests that the inference model give similar latent representations between gold-standard and synthetic

(23)

(a) Lexical distribution of the gold-standard.

(b) Lexical distribution of the CoAEVNMT model.

(c) Lexical distribution of the CoNMT model.

(d) Lexical distribution of the Conditional+Back-translation model.

(24)

sentences. Possible explanations may be that the translations are too good or the latent variable is used to represent something else.

En De

Precision 0.56 0.56 Recall 0.46 0.40 F1-score 0.51 0.47

Table 3.7: Precision, recall and F1-scores on the gold-standard vs synthetic classification task.

3.3.4 Posterior collapse

As stated in Section 3.2, variational auto-encoders are known to often suffer from collapse of the approximate posterior to the prior (Bowman et al., 2015; Sønderby et al., 2016; Higgins et al., 2017; Alemi et al., 2017), meaning that the approximate posterior essentially ignores the data. Here we perform several tests to investigate whether the posterior for the CoAEVNMT model has been collapsed or not.

Validation KL

First we check whether the validation KL is larger than 0. If the validation KL were to be 0 this would infer that the posterior has been collapsed to the prior. We show the validation KL in table 3.8. For both directions the validation KL is larger than 0, would implies that the posterior has not been collapsed.

En-De De-En Validation KL 10.8667 10.9270

Table 3.8: KL divergence between the lexical distribution of the gold-standard and the semi-supervised models.

Histogram of posterior samples

For a model p(x)q(z|x, λ) to be consistent with a model p(z)p(x|z, θ), it is necessary that the expectation EX[z|x, λ] ∼ p(z) resembles the prior we have chosen, a Standard Normal Distribution. To show this, we

infer a latent sentence representation z for each data point in the validation set, sample 100 points from that representation and create a histogram for each dimension in the latent space. The histogram of posterior samples for both directions are shown in Figure 3.6. We can see that just as required, the histograms of posterior samples for each dimension in the latent space resemble a normal Gaussian.

(a) Histogram of posterior samples for the En-De di-rection.

(b) Histogram of posterior samples for the En-De di-rection.

Figure 3.6: Histogram of posterior samples for the CoAEVNMT model per latent dimension.

Sampling from the latent space

Here we sample source sentences from the latent space to further investigate posterior collapse. We do so by sampling latent sentence representations from the Gaussian prior and greedy decode using the language model

(25)

to generate source sentences:

z ∼ N (0, 1) (3.7)

xm₁ = arg max

x

log Pθ(xm1|z) (3.8)

If every sample is the same, this would would imply that the latent sentence representation z does not contain usable information and that the posterior is be collapsed. Examples from source sentences sampled from the prior are shown in Table 3.9. We see that for both directions the sentences differ greatly, which implies that the posterior is not collapsed.

En-De

Two little girls in white skirts runs in a flower bed. A boy in a yellow shirt is surfing.

A boy wearing a blue hat is climbing on the streets. A lady in a red coat takes care of a bicycle.

A man in a tank top is sleeping on a bench while a crowd of young women look on. De-En

Ein hockeyspieler in einem rot-weien trikot hlt etwas fr einen metalltopf.

Ein schwarzer hund spielt gitarre auf rotem sand, ein mitglied des blauen teams spielt auf dem boden. Ein alter mann mit rotem hemd und nacktem oberkrper sitzt an einem tisch.

Ein mann in jeder kleidung gibt seine arbeit nachts ab. Sieben kinder essen auf dem sofa.

Table 3.9: Source sentences sampled from the latent space using greedy decoding.

Sampling from the approximate posterior

In addition to sampling from the latent space, we can sample target sentences from the approximate posterior. We do so by using a source sentence, infer the Gaussian parameters using the inference model Qα(z|xm1 ), sample

multiple latent sentence representation z and greedily decode using the translation model:

z|x ∼ Qα(z|xm1) (3.9)

yn1 = arg max y

log Pθ(y1n|xm1, z) (3.10)

Examples from target sentences sampled from the approximate poster are shown in Table 3.10. Althought the same source sentences are used, the samples show variation implying that the latent sentence representation z does contain usable information and the posterior is not collapsed. The code shows that the latent code is able to capture grammatical information, as the topic of the samples remain the same while the grammar varies. We do see duplicate samples from the model trained in the De-En direction, which may be result of two sampled latent embeddings being too close to one other.

En-De

Ein polizist, auf dem rand eines fahrzeug bewaggert ist, ein fahrzeug zu erwischen. Ein polizist hantiert an einem fahrzeug am straße rand.

Ein polizist f¨ahrt auf einem seite am fahrzeug eines fahrzeugs vorbei. Ein polizist f¨ahrt auf einem straßenrand auf einem fahrzeug ein fahrzeug. Ein polizist rast am straße rand auf der straße ein fahrzeug.

De-En

A police officer is holding a vehicle on the side of the road. A policeman is holding a vehicle on the side of the road. A policeman holding a vehicle along side of the road. A policeman is holding a vehicle on the side of the road. A police officer is holding a vehicle along side the roadside.

(26)

Interpolation

Finally, we apply interpolation between two points in the latent space. With interpolation we can examine what neighbouring points in the latent space have in common, thus investigate how the model organises information in the latent space. We do so by taking two sentences from the data set, generate a set of points on the line between them and use the language model to generate sentences using greedy decoding from those points. In Table 3.11 we see that the latent code is capturing both synthetic as semantic information. We can see that point close to one another show a lot of similarity in grammatical structure of the sentences. In addition, semantic information is also preserved between neighbouring points as they show similarities in topic.

A girl playing softball hits the ball almost directly downwards A girl in a red shirt is playing softball.

A dog is running in the grass with a ball. A brown dog is running in the grass. A brown dog running in the grass. A brown dog running next to grass.

A man is inside a truck looking out with his left arm in front of a door. A man is standing in front of a car with a large door.

A man is standing in front of a large building. A man is standing in front of a large group of people. A group of people are standing in a field.

A group of girls are cheering.

A man in a dark jacket stands next to a man dressed in brown reaching down into a bag. A man in a black jacket and a black hat stands in front of a large building.

A man in a black jacket and a woman in a black jacket are standing in a doorway. A man in a black jacket and a woman in a black shirt are walking down a street. A woman in a black shirt is walking on a beach with a woman in a black jacket. A woman in a black shirt is walking on a beach.

A woman in a white shirt is walking on the beach. Woman in a bikini top is walking on the beach

Table 3.11: Generated sentences using interpolation of the latent space. Sentences shown in bold are sentences from the data set used to interpolate between.

3.4 Related Work

Zhang et al. (2018) uses a similar approach, by training two conditional NMT models for each direction using a joint EM optimization method. Initially the two models are pre-trained using the bilingual data set. After-wards, they start the Expectation-Maximization process iterating between the E-step and the M-step. During the E-step, synthetic parallel corpora are created by completing the monolingual data sets using the model of the opposite direction. During the M-step, the models are trained on both the bilingual corpus and the monolingual corpus. In contrast to our approach where we generate a single translation to create synthetic bilingual pairs, the use the n-best synthetic translations and weight them using the translation probabilities from the NMT model. This weighting is used to reduce the negative impact of noisy translations.

Cotterell and Kreutzer (2018) interpret back-translation as a single iteration of the sleep-wake algorithm (Hinton et al., 1995) for a joint generative model of bilingual sentences. Similar to our proposed CoNMT model, they learn two conditional NMT systems in source-to-target and target-to-source directions. However, one model functions solely as an inference model and the other model functions solely as a generative model. Furthermore, the models are not trained simultaneously. First, the inference model is trained on generated samples obtained by forward sampling during the Sleep Phase. Afterwards, they complete monolingual data using the inference model, which is used to train the generative model together with the bilingual data during the Wake Phase. The models are trained by alternating between those two phrases until convergence.

Niu et al. (2018) proposed to train a single model for bi-directional neural machine translation capable of back-translation and incorporating any source or target monolingual data. Training data is constructed by swapping the source and target sentences of a parallel corpus and appending the swapped version to the original. An artificial token is added to the beginning of each source sentence to mark the desired target language. After training on the augmented parallel data set, the bi-directional model generates synthetic parallel data generated

(27)

from both source and target monolingual data, which can be used to improve the model.

In addition to back-translation, several other methods have been used to leverage monolingual data in order to improve machine translation. Gulcehre et al. (2015) integrated a language model pre-trained on target monolingual data into a pre-trained NMT system. A hidden layer is then fine-tuned to compute the output probability of the next word based on the concatenation of the hidden states of the RNNLM and the NMT system. Zhang and Zong (2016) proposed a multi-task learning framework to improve the encoder network by performing machine translation on bilingual data and sentence reordering on source-side monolingual data while sharing the same encoder network. He et al. (2016) trained NMT systems through a reinforcement learning process where two agents communicate through translation systems, one forward and one backward NMT system, and iteratively update the systems based on language-model likelihoods and the reconstruction errors.

3.5 Discussion

We performed several ablation studies to optimise the performance of our models. Firstly, we experimented using different decoding techniques for generating back-translations to complete monolingual data. Beam search consistently outperformed ancestral and greedy sampling, however performance was similar between the differ-ent beam sizes. Secondly, we showed the effect of pre-training the models in a bilingual only setting. Results showed that the performance increased using pre-training, showing that the poor back-translation at the start of the training phrase can impact the performance negatively. Finally, we investigated the effect of using a joint convergence metric. We saw that using a joint convergence metric slightly reduced the performance of our models, however we deemed this acceptable as it cuts the training time in half.

We compared our models against two supervised baselines and one semi-supervised baseline in several settings. Firstly, we compared all models on the Multi30k data set. We saw that the CoAEVNMT model outperformed all baselines models, while the CoNMT showed similar performance to the semi-supervised baseline. Secondly, we investigate the effect of reducing the available bilingual data. We saw that the semi-supervised models where in overall more robust to this reduction of data. We did see however that the CoNMT model suffered more when reducing the data from half to a quarter compared to the other semi-supervised models.

We performed several analyses to investigate the back-translation quality of the different semi-supervised models and the latent variable of the CoAEVNMT model. First, we trained a RNNLM on either gold-standard or synthetic data from the semi-supervised models and calculated the perplexity on the other data set. The model seems to generate reasonably good data, however some of the statistics of the original data distribution may be under-represented. Solving this may require careful optimisation tricks or a change in model design, e.g. making changes to the prior and posterior of the AEVNMT components. Furthermore, we investigated the lexical distributions of the different semi-supervised models, which showed that all models showed similar distributions to the real data. In addition, we trained a classification model on the latent sentence embedding to distinguish between real and synthetic data. However, the model was incapable of fulfilling that task, which implies that real and synthetic data obtain similar representations in the latent space. A possible reason could be the simplicity of the data set as grammar and vocabulary is limited. Moreover, we investigated the possibility of posterior collapse, however several analyses proved this was not the case. Finally, we apply interpolation to investigate how the model organises information in the latent space. We saw that close points showed similarity in both syntax and semantics.

Co-Training Generative Neural Machine Translation Models

MSc Artificial Intelligence

Master Thesis