Domain Adaptation of Neural Machine Translation using Monolingual Data

(1)

MSc Artificial Intelligence

Master Thesis

Domain Adaptation of Neural Machine

Translation using Monolingual Data

by

Jorn Ranzijn

11138610

December 21, 2019

36 EC March - December, 2019

Supervisor:

Dr. C. Monz

Assessor:

Prof. dr. K. Sima’an

(2)

Acknowledgements

I would like to thank my supervisor Christof Monz, who gave me the freedom to pursue my own interests and always found the time to provide helpful advice when I needed it. His experience and guidance helped steering me towards an interesting research subject for which I am very grateful. I would also like to thank my supervisors at the police, Mike and Arthur, whose enthusiasm helped me to maintain a positive outlook. They provided a lot of resources and feedback to make sure I could bring this thesis to a successful conclusion. Finally, I would like to thank my parents, brother, sister-in-law and friends who gave invaluable advice, mental support and encouragements throughout my entire study and thesis.

(3)

Abstract

Domain adaptation with monolingual in-domain data in both languages is challenging since regular fine-tuning is not possible. However, there is still demand for domain-specific machine translation models and this means that the available monolingual data has to be used effectively. This thesis aims to improve the performance of a model on in-domain words by utilizing an in-domain lexicon created using unsupervised bilingual lexicon induction.

Specifically, a contextual language model is trained on in-domain data that helps indi-cating where words from the lexicon are probable substitutions in out-of-domain parallel data. This method therefore creates pseudo in-domain data which can be used for regular fine-tuning of a neural machine translation model. Furthermore, the method is compared against two other methods that use monolingual data, DALI and backtranslation.

The results indicate that the method is effective as a domain adaptation method, given a high-quality lexicon, although it underperforms compared to backtranslation. Closer inspection of the model predictions shows that all the methods provide similar benefits as they tend to improve the translation quality by reducing the error on in-domain words. The augmented data turns out to be slightly complementary to the benefits of backtrans-lation. However, this additional benefit seems mainly due to the data selection properties of the contextual language model and not because of the substitution of in-domain words. The results also indicate that the complementary benefits of DALI and backtranslation do not extend to a high-resource scenario and that DALI proves to be ineffective for a distant language pair.

(4)

1 Introduction

1.1 Motivation

In machine translation, the goal is to automatically translate text from one language into another using a computing device (Weaver, 1949/1955). The benefits of such a system are enormous as it allows people from around the globe to communicate without having to understand anything about each other’s language. It is therefore not surprising that machine translation has a rich history, starting out with rule-based approaches in the 50’s (Hutchins, 2004), statistical approaches in the 80’s (Brown et al., 1988), followed by neural network approaches in the 2010’s (Sutskever et al., 2014; Vaswani et al., 2017).

The Georgetown demonstration in 1954 is one of the earliest public demonstrations of a working machine translation system (Hutchins, 2004). With a vocabulary of 250 words and six syntactical rules, it was used to translate a small amount of Russian sentences to English and was widely reported in the media at the time. Afterwards, the predic-tions from those in charge of the experiment about the future of machine translation were optimistic: ”Although he emphasised it is not yet possible “to insert a Russian book at one end and come out with an English book at the other”, the professor forecast that “five, perhaps three, years hence, interlingual meaning conversion by electronic process in important functional areas of several languages may well be an accomplished fact.”. The shift towards more complex methods, i.e. statistical and neural methods, has been accompanied by large improvements in translation quality, both in terms of automated metrics and human judgement. However, the ability to ”insert a Russian book at one end and come out with an English book at the other” is still out of reach, despite claims of human parity (Hassan et al., 2018; L¨aubli et al., 2018).

Recent research by Koehn and Knowles (2017) has highlighted several important chal-lenges that are hindering neural machine translation performance. One of these chalchal-lenges is the reliance on large parallel corpora for training, as neural machine translation — and deep learning in general — is notoriously data-hungry. The reason this is problematic is because high-quality parallel corpora are costly to create, since they require human effort, thus being a limiting factor for many practical applications. Still, there has been an increase in publicly available parallel corpora for many language pairs and this has helped developing capable machine translation systems (Tiedemann, 2012).

Unfortunately, these publicly available datasets prove to be a suboptimal solution when there is demand for high-quality domain-specific machine translations (Chu and Wang, 2018). A domain mismatch or domain shift occurs when the data during inference time comes from a different domain as the data during training time and, when present, leads to a diminished quality of translations. Table 1 contains an example that shows some of the problems that can occur when a domain mismatch is present. Here, two Dutch sentences are translated that have the same meaning but a different style,

(6)

specif-ically slang and standard Dutch. It becomes obvious that the translation model does not come up with a desirable translation for the Dutch slang as it is not trained on this domain-specific data. It has trouble with words that is has not seen during training time and words that have a different meaning within the context of Dutch slang.

The domain mismatch problem is also one of the challenges described by Koehn and Knowles (2017) who define a domain-specific corpus as a corpus from a specific source, that may differ from other domains in topic, genre, style, level of formality, etc. In this work, we adopt the same definition but it is noteworthy that it can be useful to have more fine-grained distinctions based on topic and genre when one wants to adapt a learned sta-tistical model (Van der Wees et al., 2015).

Table 1: A translation example that highlights the problems occurring with a domain mismatch. The two Dutch sentences have the same meaning but a different style. The first is in Dutch slang and the second is in standard Dutch. Sentences were translated with Google Translate.

Dutch sentences English translations

”Zin om wieries te klappen vanavond?” ”Want to slap whose hen tonight?” ”Zin om wiet te roken vanavond?” ”Want to smoke weed tonight?”

Overcoming the problems caused by a domain mismatch, specifically by trying to adapt a neural machine translation system such that it performs better on the data ob-served during inference time, has been the primary focus of domain adaptation. This has resulted in a large body of research, and a comprehensive overview of existing literature has been given by Chu and Wang (2018). A common way of adapting a neural machine translation model to a specific domain, for which there is a small amount of parallel data available, is by pretraining a model on out-of-domain data until convergence, followed by a smaller number of training steps on the in-domain data. This technique, called fine-tuning, is therefore reliant on parallel in-domain data while there are practical use cases where this data is not available (Luong and Manning, 2015).

One issue when trying to adapt a machine translation system towards a new domain is the increased presence of out-of-vocabulary (OOV) and rare words. Earlier research within the context of statistical machine translation has shown that these words have a detrimental effect on the produced translations (Daum´e III and Jagarlamudi, 2011; Mad-hyastha and Espa˜na-Bonet, 2017; Marton et al., 2009). However, the effect of OOV words and rare words on the quality of a neural machine translation system remains less clear.

Recently, methods have been introduced that alleviate problems caused by OOV and rare words by dividing whole words into subword units (Sennrich et al., 2016b; Wu et al., 2016). These methods have helped to reduce the vocabulary size and allow models to translate and produce words that have not been observed during training time (Sennrich

(7)

et al., 2016b). Still, it appears that the presence of OOV and rare words remains a source of erroneous translations, and it is plausible that providing explicit translations for these words can help to reduce these errors (Sennrich et al., 2016b).

A solution might come from recent developments in the field of cross-lingual word embeddings. Mikolov et al. (2013b) noted that the geometric arrangement of word em-bedding spaces are similar between languages, see Figure 1. One reason for this is that

Figure 1: Example of word embedding spaces in English and Spanish. Word embeddings were reduced to two dimensions using PCA. Adapted from ”Exploiting Similarities among Languages for Machine Translation” by Mikolov et al. (2013b).

all languages are grounded in the same world and that words are used to refer to similar concepts. Another reason is that languages can have a rich interdependent history. For instance, Italian and Spanish are both descendants of Latin. This conserved similarity makes it possible to find a sensible mapping of the word embedding space from one lan-guage to the other, thereby creating a cross-lingual word embedding space where words with similar meaning in both languages are grouped together (Mikolov et al., 2013b). A dictionary can be extracted from this shared embedding space by selecting the nearest neighbours of words in the other language, a process referred to as bilingual lexicon induc-tion (BLI). Recent methods are now able to find a mapping of a word embedding space in a completely unsupervised way (Conneau et al., 2017; Artetxe et al., 2018). This makes it possible to extract a dictionary without the need of parallel data and might provide

(8)

explicit translations of in-domain words from monolingual data.

1.2 Research questions

This thesis focuses on domain adaptation in neural machine translation with the challeng-ing assumption that there is only parallel out-of-domain data and monolchalleng-ingual in-domain data available. Specifically, we are interested in the potential benefit of a lexicon, created by unsupervised BLI, for domain adaptation purposes. We hypothesize that this lexicon can be used to improve translation quality on in-domain words by fine-tuning on aug-mented parallel data containing word translations from the lexicon.

The quality of the lexicon is likely to influence the potential benefit of the augmented parallel data. Ideally, the lexicon produced by unsupervised BLI is representative of the in-domain words. For instance, the word hash can be used in the context of computer science or drugs and the correct translation therefore depends on the domain of interest. However, due to the smaller size of in-domain corpora in general, it is unclear whether unsupervised BLI can be used to create a high-quality lexicon as earlier works indicated that BLI in a low-resource setting can be problematic (Heyman et al., 2017). This leads to the first research question:

RQ1: To what extent can a lexicon be extracted from in-domain monolingual data us-ing unsupervised bilus-ingual lexicon induction?

The next step is to create parallel data that contains the in-domain words from the lexicon. Preferably, this data is similar to the in-domain data as that is likely to lead to the most probable context for the in-domain words and largest performance gains. Therefore, a contextual language model is trained on in-domain data and used to determine which words from the lexicon are a good substitution at specific positions in out-of-domain par-allel sentences. A model is then fine-tuned on this augmented data in order to improve the performance on the in-domain data. This leads to the second research question:

RQ2: To what extent does fine-tuning on the augmented data improve the performance of a neural machine translation model on in-domain data?

Furthermore, we want to confirm that fine-tuning on the data with in-domain words is indeed beneficial for the translation quality of these specific words. This leads to the third research question:

RQ3: To what extent is the translation performance on the in-domain words influenced by adapting the model using the augmented data?

(9)

trans-lations for in-domain words. Backtranslation, further eloborated in Section 3, is another method that makes use of monolingual data but it does not provide this explicit super-vision (Sennrich et al., 2016a). This suggest that there might be complementary benefits between these two methods and leads to the final research question:

RQ4: To what extent are the benefits of the augmented data complementary to back-translation?

1.3 Contributions

Lacking parallel in-domain data means that regular fine-tuning is not applicable and this makes it difficult to adapt a neural machine translation model. Developed at the same time and independently from the work outlined in this thesis is the method of Hu et al. (2019), called Domain Adaptation by Lexicon Induction (DALI). Our papers are the first to show that a lexicon extracted from monolingual in-domain data can be used to improve the translation quality on in-domain words.

The method outlined in this thesis is compared against two other methods, DALI and backtranslation, and the results indicate that they all provide similar benefits by improving the translation quality of in-domain words. For DALI specifically, we show that the complementary benefit of DALI and backtranslation does not seem to extend to a high-resource scenario, and that DALI is ineffective for a more distant language pair. The results indicate that the language model used for data augmentation is useful as a data selection method. Furthermore, the observed complementary benefit of the augmentation data and backtranslation seems mainly due to this data selection property and not due to the substituted in-domain words.

1.4 Thesis outline

Section 2 contains background information about the different model architectures used for the data augmentation and machine translation experiments. Furthermore, it discusses the performance metric used to assess the model prediction quality and the distance metric used for nearest neighbour selection during BLI. Section 3 discusses the related work, particularly other domain adaptation methods that use monolingual data and methods that focus on improving the translation of rare and OOV words. Section 4 gives a general overview of the method that produces the data with in-domain words and how it relates to DALI, while Section 5 focuses on the experimental setup. Section 6 shows and discusses the results. Finally, we conclude in Section 7 by reflecting on the research questions while also outlining possible future research directions.

(10)

2 Background

2.1 Neural encoder-decoder models

Machine translation is a sequence-to-sequence modelling task and this makes it naturally suited for encoder-decoder model architectures. In these models, a neural network first ”encodes” a source sentence into vector representations and subsequently ”decodes” these representations into a sentence in the target language. Formally, this can be formulated as a conditional language modelling task:

p(yt|X, Y<t) (1)

Where the goal is to predict a word at time step t based on the encoder representation X and decoder representation Y<t of the previous time steps.

2.2 Encoder-decoder with attention

The earliest neural encoder-decoder models were based on recurrent neural network (RNN) units, specifically long short-term memory (LSTM) units (Elman, 1990; Sutskever et al., 2014; Cho et al., 2014). The final representation produced by the RNN-based encoder is fixed and this means that the amount of information that it can contain is limited. This can be problematic, especially for longer sentences, as at every time step the hid-den representation is updated, possibly losing information from previous time steps. In machine translation, where the goal is to translate the entire input sentence, this can be detrimental to translation quality (Bahdanau et al., 2014).

Sutskever et al. (2014) noticed that reversing the input sentence is an approach to make this problem less severe for languages that follow a similar grammatical structure. That is because the decoder has no information from earlier time steps at the start and bases its predictions solely on the encoder information and first input. Therefore, having a final representation that contains high-quality information about start of the input sen-tence simplifies the task while later time steps can make use of the additional target side predictions. Another possibility is to encode a sentence in both the forward and reverse direction with the help of a bidirectional RNN (Schuster and Paliwal, 1997; Bahdanau et al., 2014).

The issue with both of these approaches is that they still provide a representation of the input sentence that remains the same at every time step in the decoder. Bahdanau et al. (2014) introduced an adaptive approach, called attention, where a representation is created by ”soft searching” the input sentence for the most relevant information. This circumvents the problem of having a single fixed representation and allows the model to select the relevant information from the input at every time step. This method has greatly improved the performance of neural encoder-decoder models, especially on longer

(11)

sentences (Bahdanau et al., 2014).

The adaptive representation at time i, also called context vector ci, depends on the

encoder hidden states, (h1, h2, ..., hT), decoder hidden state, si, and alignment scores, eij,

between the encoder hidden states and decoder state. Here eij is defined as:

eij = a(hj, si) (2)

Where a is a compatibility function that maps the hidden states to a single number, which can be interpreted as the contribution of a word in the input sentence, relative to the other words in the input, for predicting the next output word. Bahdanau et al. (2014) use a feedforward neural network to calculate this alignment score which is also called additive attention:

eij = V tanh(W [hj; si]) (3)

Here, V and W are weight matrices that are learned alongside the other encoder-decoder parameters. These alignment scores, one for each input word, are converted to a proba-bility distribution using a softmax function:

pij =

exp(eij)

Pk

zexp(eiz)

(4)

Next, the context vector is calculated as a weighted sum of the encoder hidden states and the alignment probabilities decide how much attention should be given to each specific hidden state of the input:

ci = T X j pijhj (5)

2.3 Transformer architecture

The Transformer model is a neural network encoder-decoder architecture that was specif-ically developed for neural machine translation and has generally surpassed the perfor-mance of the RNN-based encoder-decoder with attention (Lakew et al., 2018; Vaswani et al., 2017). This architecture no longer uses any recurrent connections and consists of only feedforward neural networks and attention layers. It expands on the insight of Bahdanau et al. (2014) that having a fixed size representation of a word is a bottleneck and that attention can be used to determine what information is relevant.

In the RNN-based encoder-decoder, the context vector was calculated as a weighted average of the hidden states of the encoder. This helped to reduce the problem of having a fixed representation that is unable to store the all relevant information for every decod-ing step. However, this problem is not limited to the context vector representation but also occurs in the word representations of the encoder and decoder hidden states them-selves (Vaswani et al., 2017). Languages contain long distance dependencies and RNN’s

(12)

force the hidden representations along a path while words along that path have varying amounts of relevance to each other. A model can create better word representations when it is able look at the representation of the other words in the input and select the relevant information. This can again be done with the help of attention.

Attention

Attention takes a number of vectors as input, specifically key-value pairs and queries, and maps these vectors to an output vector. This output vector is a weighted average of the values where the weighting is based on the compatibility function between the keys and queries. There are multiple ways to calculate this compatibility function and the two most common methods are additive attention, see Equation 3, and dot-product attention (Vaswani et al., 2017; Bahdanau et al., 2014; Luong et al., 2015). The Transformer makes use of scaled dot-product attention where the compatibility function is a dot product between the keys and query, divided by a scaling factor √dk, followed by a softmax

function: Attention(K, Q, V ) = softmax(QK T √ dk )V (6)

Here, dk is the dimension of the query and key vectors and the dot product is scaled

by this factor in order to reduce its magnitude. This scaling is necessary since unscaled dot-product attention leads to diminished performance for large values of dk. A likely

explanation is that a large magnitude pushes the softmax function into regions where it has small gradients, leading to inefficient learning (Vaswani et al., 2017).

Attention makes it possible to create an output vector with relevant information by applying a weighted summation over the value vectors. However, there is no inherent reason why this should be limited to a single output vector. This led to the concept of multi-head attention where multiple weighted averages are created from the same value vectors. Each separate attention layer, referred to as an attention head, can potentially focus on different aspects of the input representation and inspection of the attention heads in the Transformer model showed that these heads tend to learn different tasks.

MultiHead(K, Q, V ) = Concat(head1, head2, ..., headh)WO (7)

where headi = Attention(KWKi , QW Q i , V

V

i ) (8)

where the matrices W are model parameters with dimensions WK_i _{∈ R}dmodel×dk_{, W}Q

i ∈

Rdmodel×dk, VVi ∈ Rdmodel

×dv _{and W}O

i ∈ Rhdv

×dmodel_.

Transformer encoder and decoder

Recurrent models have the advantage that they naturally handle the ordering of variable length sequences as they process the input in discrete steps. This advantage is lost with

(13)

the Transformer, since it uses feedforward neural networks, and positional embeddings were introduced that account for the order and location of the input words. These posi-tional embeddings are learned parameters whose values are added to the word embedding vectors and used as input for the encoder.

The encoder consists of six identical building blocks where each block contains a self-attention layer followed by a feedforward neural network. The self-self-attention layer uses multi-head attention with eight attention heads whose outputs are concatenated and fed through an additional linear layer, see Equation 7. Also, both layers use residual connec-tions which means that the output of each layer is summed with its respective inputs. These final outputs are normalized using layer normalization (Ba et al., 2016).

The decoder also consists of six identical building blocks and, in addition to the feed-forward neural network and self-attention layer, these blocks contain an encoder-decoder attention layer that attends to the output of the final encoder block. This encoder-decoder attention layer serves a similar function as the attention mechanism described by Bah-danau et al. (2014) and is situated between the self-attention and feedforward neural network layer. The three layers make use of residual connections with layer normalization and all the attention layers use multi-head attention with eight heads. One important difference is that the self-attention in the decoder is not allowed to use the vector rep-resentation of words that have not been predicted up until that decoding step. This is achieved by masking the word representations of future time steps.

2.4 Performance metric

BLEU, bilingual evaluation understudy, is the most common automated metric to assess the quality of a machine translation model’s prediction (Papineni et al., 2002). It is a precision-based metric that measures the overlap in n-grams between a prediction and one or more ground truth sentences. A problem with BLEU is that it does not assign a high score to predictions that are valid translations but dissimilar to the ground truth. This makes BLEU relatively uninformative at a sentence level but taken over a complete corpus it tends to correlate well with human evaluation (Papineni et al., 2002; Dreyer and Marcu, 2012).

2.5 Cross-Domain Similarity Local Scaling

In order to extract a dictionary from cross-lingual word embeddings there needs to be a distance metric that determines the nearest neighbour of every word. A commonly used distance metric is the cosine similarity, but using this metric directly leads to suboptimal results due to the hubness phenomenon in high dimensional spaces (Radovanovi´c et al., 2010). Specifically, in these spaces there are some words that are the nearest neighbour of many other words, called hubs, and words are not the nearest neighbour of any other word, called anti-hubs. Therefore, Conneau et al. (2017) introduced a modified metric called

(14)

cross-domain similarity local scaling (CSLS). This metric measure the average cosine similarity, cos, for each word in the source language with its K nearest neighbours in the target language and vice versa:

rT(W xs) = 1 K X yt∈Nt(W xs) cos(W xs, yt) (9) rS(yt) = 1 K X W xs∈Ns(yt) cos(yt, W xs)) (10)

Where W x is a mapped embedding in source language, y an embedding in the target language and N the set of K nearest neighbours. The new similarity between two words is a modified cosine similarity where the score is penalized if the specific words tend to be hubs.

CSLS(W xs, yt) = 2cos(W xs, yt) − rT(W xs) − rS(yt) (11)

2.6 Contextual BiLSTM Language Model

One challenge, as will be outlined in Section 4.2, is to find probable context for in-domain words by substituting these words into out-of-domain parallel sentences. Specifically, the aim is to find word positions in the out-of-domain sentence where in-domain words are a good replacement and a good replacement is considered to be a sentence that remains semantically and syntactically correct. The contextual biLSTM (cBiLSTM) language model is a model that predicts the probability of a word at a specific position in a sentence based on its full left and right context and seems particularly suited for this task (Mousa and Schuller, 2017). Formally, the model tries to solve a conditional language modelling task:

p(wm|w1m−1, w M

m+1) (12)

This approach has a clear advantage over standard unidirectional language models as these models predict the probability of a word based on only the left or right context:

p(wm|wm−11 ) (13)

Another solution would be BERT, which is a Transformer encoder trained with a masked language modelling objective (Devlin et al., 2018). This model has resulted in state-of-the-art performance on a variety of language understanding tasks (Wang et al., 2018) and is likely to surpass the performance of the cBiLSTM. However, it does require large amounts of training data and is known to be computationally expensive to train, and is left as future work because of this.

(15)

3 Related Work

There have been multiple studies that make use of monolingual data in machine trans-lation which are also applicable to domain adaptation. Currey et al. (2017) showed that copying the target side data to the source side and mixing this with true parallel data can improve model performance in low-resource scenarios. They also showed that the performance gains are primarily due to words that should remain the same in the source and target language, such as named entities. Interestingly, this method does not seem to extend to high-resource scenarios and can even hurt performance (Khayrallah and Koehn, 2018; Currey et al., 2017).

Another method that makes use of monolingual data is backtranslation where a model is first trained in the reverse translation direction, i.e. target-to-source, on parallel data (Sennrich et al., 2016a). Next, this trained model is used to translate the target side monolingual data, thereby creating predictions that can be used as synthetic parallel data. Including this synthetic parallel data in addition to true parallel data has been shown to improve model performance (Sennrich et al., 2016a). For the purpose of domain adaptation, monolingual in-domain data can be translated using a model that was trained in the reverse direction on out-of-domain parallel data, resulting in synthetic in-domain parallel data. Fine-tuning on this data proved to be an effective method for domain adap-tation (Sennrich et al., 2016a).

Recently, promising results have been achieved with unsupervised neural machine translation, removing the need for any parallel corpora (Lample et al., 2017; Artetxe et al., 2017; Lample and Conneau, 2019; Song et al., 2019). Lample and Conneau (2019) identify three key principles that they believe are critical to the success of unsupervised machine translation. The first principle is finding a suitable initialization of the translation model parameters. The earliest works train unsupervised cross-lingual word embeddings and initialize the word embedding layers’ parameters with these pretrained representa-tions (Artetxe et al., 2017; Lample et al., 2017). Later approaches extend and improve on this approach by initializing the entire encoder and decoder with pretrained parameters (Lample and Conneau, 2019; Song et al., 2019).

The second principle is having a language modelling objective. This can be achieved by noisy autoencoding, which is fairly similar to the method of Currey et al. (2017) where the target side data is copied to the source side. The key difference is that tokens in the input sentence are swapped and masked with a fixed probability in order to prevent the model from learning how to copy.

The third principle is iterative backtranslation. Training on synthetic parallel data is critical in order to learn a mapping between the source and target language. Having a translation model in both the forward and reverse direction makes it possible to train in an iterative fashion as the synthetic parallel data improves the model, leading to better synthetic data in the reverse direction.

(16)

Still, there are multiple challenges remaining that make unsupervised machine trans-lation less attractive compared to fine-tuning based approaches. One downside is that unsupervised machine translation relies on large amounts of monolingual data. This makes it less suitable for domain adaptation as domain specific corpora tend to be rel-atively small (Koehn and Knowles, 2017; Tiedemann, 2012). Another downside is that it is computationally expensive to pretrain a model that can be used to initialize the encoder and decoder parameters. Finally, there is still a gap between the performance of supervised and unsupervised approaches in favor of supervised approaches.

All the mentioned methods are able to utilize monolingual in-domain data for the purpose of domain adaptation. However, there are also methods that explicitly focus on improving the translations of rare or OOV words.

Fadaee et al. (2017) augment parallel sentences by replacing words in plausible po-sitions with infrequent words, thereby creating additional context for these rare words. Unidirectional language models are used to determine how plausible an infrequent word is at a given position in the target side sentence. Next, the most probable translation of the infrequent word is substituted in the source side sentence using lexical translation probabilities. This approach tries to ensure that the new parallel sentences remain valid translations that are both syntactically and semantically correct (Fadaee et al., 2017). Fine-tuning on the data generated by this approach turned out to be an effective method for improving rare word translations in a low-resource setting.

Arthur et al. (2016) incorporate probabilistic lexicons in a neural machine translation system using different weighting strategies. Biasing the model towards specific transla-tions for rare words improved translation quality and led to reduced training times.

Kothur et al. (2018) obtain a curated list of novel word translations and treat this as parallel data. Fine-tuning on this data for a small number of iterations is useful for learning explicit word translations while maintaining similar translation performance at a document level.

The previously mentioned methods are able to improve translations of OOV and rare words and can be used in a domain adaptation setting. However, they all assume that parallel data is present from which a dictionary can be extracted and this is problematic in the case of domain adaptation with only monolingual data.

Developed at the same time and independently from the work outlined in this the-sis is the method of Hu et al. (2019), called Domain Adaptation by Lexicon Induction (DALI). They extract a lexicon from in-domain mononlingual data in two languages us-ing unsupervised cross-lus-ingual word embeddus-ings. Next, they use this dictionary to create word-for-word translations of the target side in-domain data. The target word is simply copied when a word does not appear in the extracted lexicon. Experiments on datasets with varying sizes and from different domains show that DALI is an effective domain adaptation method. Also, their results show that backtranslation and DALI appear to have orthogonal benefits in a low-resource scenario for a closely related language pair.

(17)

4 Method

This section gives a general overview of the method that generates parallel sentences with in-domain words, which can be roughly divided into two parts. The first part is about lexicon induction and describes how a lexicon is created of the relevant in-domain words. The second part describes the data augmentation procedure where suitable augmentation positions in the out-of-domain parallel sentences are found and in-domain words from the lexicon are substituted. Also, Figure 2 gives a graphical overview of the complete method and numbers in boldface throughout this section refer to steps in the flowchart.

4.1 Bilingual lexicon induction

This step follows the common lexicon induction procedure where at first word embeddings are trained on monolingual data for both the target and source language (1). Next, a joint embedding space is created with the help of a learned transformation that maps the source language word embeddings to the target language space (2)(Artetxe et al., 2018). A lexicon is extracted from the joint embedding space by selecting words that are mutual nearest neighbours using the CSLS metric (3), see Section 2.5.

The goal of the overall method is to generate data that contains words which are characteristic of the in-domain data but not for the out-of-domain data. This is similar to the problem of data selection where data that is representative of the in-domain data has to be selected from a larger set of general data. A well known approach to that problem is to select data based on the cross-entropy difference (Moore and Lewis, 2010) and a modification of this idea is used to calculate the importance of a word.

score = HI(wi) − HO(wi)

HI(wi) = −log(pI(wi))

HO(wi) = −log(pO(wi))

Here, wi is an in-domain word, log(pI(wi)) and log(pO(wi)) are the log probabilities for

the in-domain and out-of domain, respectively. Add-one smoothing is used in order to obtain valid log probabilities for in-domain words that do not appear in the out-of-domain data. Finally, a threshold is set on the score to determine the final lexicon of relevant in-domain words (4).

4.2 Data augmentation

The next goal is to find the positions in the out-of-domain parallel sentences where in-domain words are a good substitution. The original words in these parallel sentences do not always have a one-on-one alignment. This means that substituting an in-domain word

(18)

in any of these positions is likely to result in a sentence that is no longer syntactically correct. Therefore, word alignments are used to determine the positions in a sentence that have a one-on-one word alignment (5). Next, a cBiLSTM, see Section 2.6, is trained on the target side data (6). This model returns a probability distribution over the vocabulary words and indicates how likely an in-domain word is at any given position in the sentence. The in-domain lexicon, word alignments and cBiLSTM are then used to augment out-of-domain parallel sentences. An in-domain word is substituted in the target side out-of-domain sentence if it is present in the top K most likely words returned by the cBiLSTM and if the position has a one-on-one word alignment. Finally, the corresponding translation is substituted in the source sentence where the position is also indicated by the word alignment. This results in out-of-domain parallel data containing in-domain word translation which can be used for fine-tuning (7).

4.3 Comparison with DALI

Compared with DALI, the method outlined in this work focuses on a subset of words that are representative of the in-domain data and we hypothesize that the remaining in-domain words are less problematic to translate due to their presence in the out-of-domain par-allel corpora. Another difference is that DALI creates word-for-word translations of the in-domain data while this method substitutes in-domain words in parallel out-of-domain sentences. The word-for-word translation approach will likely produce synthetically incor-rect sentences when words in the source sentence do not have a one-on-one word alignment with the words in the target sentence and this problem does not necessarily occur with the augmentation strategy. A benefit of DALI is that the produced in-domain parallel data will match the distribution of the true in-domain data more closely compared to the augmented data, where in-domain words are substituted as long as they are a good fit according to the cBiLSTM. This means that a skewed distribution of in-domain words could be present in the augmented data compared to the true distribution.

(19)

Figure 2: Flowchart that gives an overview of the method and an elaborate de-scription is provided in Section 4.

(20)

5 Experimental Setup

5.1 Data

An overview of the different corpora used in this work is given in Table 2, these datasets are all publicly available (Tiedemann, 2012; Ziemski et al., 2016; Antonova and Misyurev, 2011). The machine translation experiments are conducted on either the Russian-English or German-English language pairs. The subtitle corpus is used as in-domain data, for both Russian and German, while the other corpora are merged and function as out-of-domain data. In order to create monolingual in-domain data, the subtitle data is first shuffled and divided into two parts. After this, the monolingual corpora are finalized by selecting the target and source language data from the two different halves. Also, only 3 million subtitle sentences are selected at random in total for every language, since this is more representative of regular in-domain corpora sizes (Koehn and Knowles, 2017; Tiedemann, 2012).

Table 2: Corpora statistics of the English data. W/S is the average number of English words per sentence.

Corpus Sentences Words W/S

Subtitles 2016 (Ru-En) 15,444,091 122,596,750 7.9 News articles WMT 2018 (Ru-En) 235,159 5,952,620 25.3 Yandex (Ru-En) 1,000,000 19,821,209 19.8 UN corpus (Ru-En) 23,239,280 562,432,851 24.0 Subtitles 2016 (De-En) 13,883,398 106,848,917 7.7 Wikipedia (De-En) 2,414,839 47,939,158 19.8 EUbookshop (De-En) 8,312,724 190,561,369 22.9

5.2 Bilingual lexicon induction

For Russian-English, word embeddings are created from either the in-domain data or the joint corpus of in-domain and out-of-domain data. This should help answering RQ1 by showing the difference in BLI performance given word embeddings trained in a low-resource and high-low-resource scenario. For German-English, word embeddings are trained on only the joint corpus due to the outcome of the Russian-English experiments. Both FastText and Word2Vec (Bojanowski et al., 2017; Mikolov et al., 2013a) are used for the embeddings trained on the in-domain data while the joint set is trained using just Word2Vec. Both methods are available in the Gensim package ( ˇReh˚uˇrek and Sojka, 2010) and default parameters are used during training, except for the window size, epochs and number of negative samples. These parameters are all set to ten as this improves over the

(21)

default settings and is in line with the work of Hu et al. (2019). All trained embeddings have a dimension size of 300 unless stated otherwise. The sentences are tokenized using Spacy (Honnibal and Montani, 2017) with a corresponding language model for English and German. The Russian sentences are tokenized using the English language model as a corresponding language model is not available. This might be suboptimal but qualitative inspection showed sufficient quality tokenization.

The word embeddings are then mapped in an unsupervised manner to a joined word embedding space using VecMap (Artetxe et al., 2018). Next, the similarity between the mapped word embeddings in the two languages is calculated with the CSLS metric and words that are each other’s nearest neighbours are selected. For the CSLS metric, the hyperparameter K is set to ten which is identical to earlier work (Conneau et al., 2017; Artetxe et al., 2018).

The publicly available dictionaries from the MUSE library are used as ground truth word translations in order to assess the quality of the extracted lexicon (Conneau et al., 2017). This is suboptimal since this test set is not truly representative of the in-domain words but it is still a good indication of the quality of the mapping. The coverage, per-centage of words in the ground truth dictionary that appear in the lexicon, and accuracy, percentage of words that have a correct translation, are reported. Finally, a subset of the lexicons is created with the help of the modified cross-entropy difference score, see Section 4.1. The threshold is set to the value of words that occur at least five times in the in-domain data and zero times in the out-of-in-domain data. This is based on a trial-and-error approach and because words with fewer occurrences are excluded during word embedding training.

A lexicon is also extracted from the in-domain data using a supervised approach in order to evaluate the performance of domain adaptation methods when a high-quality lex-icon is used. First, word alignments are created of the subtitle sentences using Fastalign and a phrase extraction algorithm returns the most likely word translations (Dyer et al., 2013; Koehn, 2010). The most probable one-on-one word translation is kept for every word as this resembles the lexicon produced in an unsupervised manner.

5.3 Data augmentation

The out-of-domain data is tokenized using Spacy, and word alignments are created using Fastalign. Next, a cBiLSTM is trained on the English in-domain data where words that occur less than three times are set to a special token in order to reduce the vocabulary size. The cBiLSTM consists of three hidden layers with 512 hidden units, an embedding layer with 256 units and is trained until convergence using Adam with a learning rate of 10−2 and learning rate decay with a factor of 10−2 (Kingma and Ba, 2014). Model performance is tested on a separate validation test after every epoch and the best model is selected. The Yandex and Wikipedia corpora functioned as out-of-domain parallel data for augmentation for Russian and German, respectively. Out-of-domain sentences that

(22)

contain more than 100 words are discarded and unknown words are converted to the special token. Also, only a single position is augmented in every sentence because this likely results in less syntactical and semantical errors compared to multiple augmentations. A word is substituted in the sentence when it occurs in the top 100 of most probable words, as indicated by the cBiLSTM. Finally, the top 500 sentences with the highest probability are selected for every word. Lower values for these two hyperparameters means that the words are more likely to be a good fit but also lowers the size of the final dataset.

5.4 Machine translation

All sentences are tokenized using byte pair encoding and a fixed size vocabulary is created of 40.000 tokens by jointly processing the source and target language. For Russian, the data is first transliterated from Cyrillic to the Latin script before creating the byte pair tokens as described by Sennrich et al. (2016b). Sentences that are shorter than three tokens or longer than 150 tokens after the byte pair encoding transformation are discarded. No further preprocessing steps are applied to the data.

The original Transformer architecture, see Section 2.3, and training procedure are used for all machine translation experiments, as implemented in the Pytorch version of OpenNMT (Paszke et al., 2017; Klein et al., 2017). Models are trained with either ten Nvidia 1080 Ti or three Nvidia RTX 2080 Ti GPU’s due to hardware availability. The baseline model is trained on just the out-of-domain data and model performance is tested after every 2000 training steps on a separate validation set. Training stops when there has been no improvement for ten consecutive validation runs.

The domain adaptation method in this work is also compared against DALI and backtranslation (Hu et al., 2019; Sennrich et al., 2016a). Furthermore, a model is trained on the original sentences that are selected by the cBiLSTM. The difference in performance between models trained on the augmented and original sentence can help to answer RQ2 by clarifying the contribution of substituting in-domain words. DALI is fine-tuned on a joined set of in-domain and out-of-domain data, which matches the approach described by Hu et al. (2019), while other methods solely use the in-domain data. Both DALI and the augmentation data are combined with backtranslation data in order to discover the possible complementary benefits and this helps answering RQ3. The fine-tuning procedure uses no warm-up steps and a reduced learning rate of 10−3. Performance is tested every 100 training steps on a separate validation set and training stops when there is no improvement after five consecutive validation runs. Differences between BLEU scores are checked for significance using Randomized Significance Testing (Riezler and Maxwell, 2005) with 2000 random shuffles.

(23)

6 Results and Discussion

6.1 Bilingual lexicon induction

Table 3 shows the accuracy on the ground truth English-Russian dictionary when word embeddings are trained on the in-domain data. What becomes apparent is that the accuracies are low and that this is independent of the embedding size or training method used.

Table 3: Accuracies on ground truth dictionary for English-Russian using different embedding dimen-sions and training methods.

Dimension 50 100 300 512 1024 Word2Vec 2.61 1.57 5.00 3.28 2.70 FastText 0.09 0.62 1.70 -

-The accuracy and coverage on the ground truth dictionary for the lexicon obtained from the joined set of in-domain and out-of-domain data and for the supervised lexicon are shown in Table 4. Compared with the results in Table 3, it is evident that the increased amount of data is necessary for a high quality mapping and this seems due to the improved quality of the word embeddings. Also, comparing the unsupervised with the supervised approach confirms the higher quality of the lexicon obtained with the supervised method. Reflecting on RQ1, these result confirm the observation of Heyman et al. (2017) that training on a smaller corpus is problematic for BLI. Also, Heyman et al. (2017) show that relying on the embeddings of low-frequency words for the mapping of the two embedding spaces is contributing to the diminished performance.

Table 4: Coverage and accuracy on the ground truth dictionary for the unsupervised lexicon trained on the joined set of in-domain and out-of-domain data and supervised lexicon. Coverage refers to the percentage of words in the ground truth dictionary that were present in the extracted lexicon.

Coverage Accuracy En-Ru (unsupervised) 62.96 35.17 En-De (unsupervised) 67.62 76.60 En-De (supervised) 97.56 86.31

Looking at results of the two different language pairs, it is noteworthy that there is a similar coverage but a large difference in the obtained accuracies. This indicates that

(24)

obtaining a high-quality lexicon from cross-lingual word embeddings is intrinsically harder for Russian-English than for German-English and is line with results obtained in earlier work (Glavaˇs et al., 2019). Specifically, VecMap assumes that an orthogonal mapping exists between the word embedding spaces of the two languages. It turns out that this assumption gets increasingly violated for more etymologically distant language pairs and this helps explain the reduced performance of Russian-English (Riley and Gildea, 2018). Only keeping the elements in the lexicons that are each other’s nearest neighbours ensures that translations are more likely to be correct, but also means that more words are discarded. For Russian the amount of discarded words is higher and Table 5 shows the final vocabulary sizes of the obtained lexicons. It also shows the percentage of words in the English in-domain data for which the lexicons have a corresponding translation. Unsurprisingly, having a reduced lexicon size means that there are more words for which the lexicons do not have a corresponding translation, thus leading to a lower in-domain coverage. Having a high in-domain coverage is critical for DALI as this method produces a word-for-word translation of the in-domain words and otherwise copies the target side word.

Table 5: Vocabulary sizes of the different lexicons and the per-centage of words in the in-domain data for which these lexicons have a direct translation.

Vocabulary size In-domain coverage En-Ru unsupervised 100324 69.24

En-De unsupervised 208099 79.83 En-De supervised 273312 99.81

A subset of the extracted lexicon is created using the modified cross-entropy difference. Figure 3 shows the ordered scores and the threshold is highlighted in orange. This thresh-old corresponds to words that appear five times in the in-domain data and zero times in the out-of-domain data. Selecting words beneath the threshold results in final lexicon sizes of 2360 and 4907 for German-English and Russian-English, respectively. Selecting words beneath the threshold from the supervised lexicon results in a final lexicon size of 5671.

(25)

(a) Russian-English

(b) German-English

Figure 3: The modified cross-entropy difference score of in-domain words. The scores are ranked from low to high and the threshold is highlighted in orange.

(26)

6.2 Data augmentation

A random selection of augmented out-of-domain sentences is shown in Table 6 and high-lights some qualitative aspects of the cBiLSTM. In general, the model substitutes words at syntactically correct places but has trouble to create sentences that are semantically plausible. This is also apparent in Table 6 where the first, second and fourth example all make sense syntactically but not semantically. The all capital word ’JIMMY’ occurs often in subtitles before a colon and therefore makes sense at that position. However, it is followed by German words, and these are cast to unknown tokens by the language model since it is only trained on in-domain data, resulting in a semantically implausible sentence. Only the third example is syntactically and semantically plausible while the last example is neither.

Implausible sentences are a recurring problem when the cBiLSTM is trained on only in-domain data. Ideally, training on the in-domain data leads to a model that is able to estimate where specific in-domain word are a good fit in the out-of-domain sentence. However, having lots of unknown words because of the mismatch between the data at training and testing time makes this difficult. Looking at the out-of-domain data, more than 70 percent of the sentences has at least one unknown word and close to 9 percent of the total words does not occur in the cBiLSTM vocabulary. Training on the joined set of in-domain and out-of-domain data would lead to a reduced amount of unknown words but makes it harder to estimate how well the context and in-domain word match the true in-domain data.

(27)

Table 6: Random samples from the German-English augmented data. The left column shows the original sentences and the right column the augmented sentences. The words that are replaced and the substituted words are shown in boldface.

Original sentence Augmented sentence

During the ninth week, the pups venture out of the pouch and onto the mother’s back, where they remain for six weeks.

During the ninth week, the Ferengi venture out of the pouch and onto the mother’s back, where they remain for six weeks.

” in: Jahrbuch der Karl May -Gesellschaft 1978, S. 9 - 36.

” JIMMY: Jahrbuch der Karl May -Gesellschaft 1978, S. 9 - 36.

Down! Hurry!

He was cremated, and his ashes were scattered.

He was cremated, and his eyeballs were scattered.

He subsequently returned to Switzerland, where he married Elisabeth de Reynold (a daughter of Gonzague de Reynold) and pursued an academic career.

He subsequently returned to Switzerland, where he married Elisabeth de Reynold (a Beeps of Gonzague de Reynold) and pursued an academic career.

6.3 Machine translation

6.3.1 Unsupervised lexicon

Results of the machine translation experiments for Russian-English and German-English are shown in Table 7 and we first discuss the results relevant to RQ2.

Looking at the results of the models trained on the original sentences, it becomes clear that this data is already beneficial for domain adaptation purposes. It is known that neural language models trained on in-domain data can be used to select in-domain sentences from a general corpus (Duh et al., 2013). Since the cBiLSTM is trained on in-domain, it should favor in-domain words within context that is representative of the in-domain data. Finding plausible context by the cBiLSTM therefore suggests that the model is selecting sentences that are also representative of the in-domain data, thereby doing a form of data selection. However, further experimentation is needed to determine whether this is true as the data used for augmentation is only a subset of the complete out-of-domain corpus.

The score of the model trained on the augmented sentences turns out to be lower than the model trained on the original sentences for Russian-English but the opposite is true for the German-English language pair. This indicates that adding the in-domain words is hurting performance for Russian-English but benefits German-English. Looking at DALI,

(28)

a similar pattern can be observed as the gain for Russian-English is minimal while there is a larger gain for German-English. These results suggest that the quality of the lexicon is critical for the augmentation strategy and DALI to be beneficial. Specifically, the accuracy on the ground truth dictionary, see Table 4, is around 35 percent and 76 percent for Russian-English and German-English, respectively. Introducing faulty in-domain word translations is therefore much more likely for Russian-English and can explain the reduced performance.

Also, Braune et al. (2018) show that the bilingual lexicon induction performance on rare and infrequent words is dramatically worse compared to more frequent words. Having only a limited amount of context for a word results in a word embedding that has a high chance to have a relatively random position in the cross-lingual mapping space. As a consequence, the obtained in-domain lexicon is likely to contain a larger amount of incorrect translations as the ground truth dictionary suggests and also helps explaining the lack of a larger improvement for both Russian-English and German-English.

The contribution of the in-domain words seems relatively low for the German-English data, even though the difference between the original and augmented model score is significant. It turns out that approximately 5.4 and 3.8 percent of all the words in the test set are identical to the words that were substituted when augmenting the out-of-domain data for Russian-English and German-English, respectively. It could be that the overall gain in BLEU that can be achieved due to the substitutions is inherently limited since the occurrence of these in-domain words is low in the test set. It would be interesting to expand the lexicon to include more in-domain words and see whether this leads to larger gains.

Regarding RQ4 and the German-English language pair, both the original data and augmentation data in combination with backtranslation data provide small but significant benefits compared to backtranslation data alone. However, the difference between the methods in combination with backtranslation is not significant. This suggest that the gain might be due to the out-of-domain data selected by the language model instead of the substituted in-domain words and Section 6.5 provides additional support for this claim.

(29)

Table 7: BLEU scores on the in-domain data. Here ’AUG’ refers to the augmented sentence, ’ORI’ to the original unaugmented sentences and BT to backtranslation. Best scores are in boldface. Significance testing results can be found in appendix II. **Models that are fine-tuned on a machine with three GPU’s instead of ten due to hardware availability. Mentioned for completeness as this leads to a smaller effective batch size and can affect final scores (Popel and Bojar, 2018).

Model type Russian-English German-English

Baseline 10.97 13.34 BT 13.32 17.21 DALI 11.12 14.42 ORI 12.51 14.14 AUG 12.08 14.51 AUG + ORI 12.28 14.71 DALI + BT 12.83 16.13** AUG + BT 13.33 17.49** ORI + BT 13.36 17.52** AUG + ORI + BT 13.43 17.15** 6.3.2 Supervised lexicon

A lexicon is extracted in a supervised manner in order to measure the influence of larger and more accurate lexicon on DALI and the augmentation strategy. The results are shown in Table 8 and indicate a similar performance for the augmentation strategy compared to the unsupervised result but an improvement for DALI. The improvement of DALI is in line with expectations, given that an increased in-domain coverage, see Table 5, is known to improve performance (Hu et al., 2019). That the augmentation strategy does not seem to profit is surprising since a slightly larger and more accurate lexicon seems only beneficial.

Table 8: German - English supervised

Model type German-English Baseline 13.34

DALI 15.10

ORI 14.35

AUG 14.49

(30)

6.4 DALI performance

DALI does improve over the baseline method for German-English, which is in line with the results of Hu et al. (2019), but does not help for Russian-English. There are multiple factors that can contribute to this failure. One important factor seems the quality of the Russian-English lexicon. Having close to 35 percent accuracy, see Table 4, suggests that in many cases a word-for-word translation is wrong and this leads to low-quality synthetic data. Also, the size of the lexicon becomes smaller due to the removal of translations that are not mutual nearest neighbours. As a consequence, the coverage of the in-domain data is also reduced, being close to 70 percent, see Table 5. DALI copies the word from the target sentence when it is not present in the lexicon indicating that almost 30 percent of the source side words is English instead of Russian. This is in line with Hu et al. (2019) who show that reducing the in-domain coverage reduces performance for DALI.

Word alignments were created using Fastalign on the WMT 2018 news articles for Russian-English and German-English in order to measure the number of one-on-one word alignments. This turns out to be 50 percent and 58 percent on average per sentence for Russian-English and German-English, respectively. Having a low number of one-on-one word alignments means that the resulting translation are more likely to be syntacti-cally incorrect and this negatively influences the quality of synthetic data. However, for German-English the percentage of one-on-one alignments is not much higher and DALI does prove to beneficial for that language pair. This suggests that having one-on-one alignments is less important than the other factors, i.e. lexicon accuracy and in-domain coverage, in explaining the performance difference between the two language pairs. Also, this is in line with the results of the supervised lexicon where an increased in-domain coverage and accuracy does lead to a substantial improvement.

Looking at the combination of DALI and backtranstlation, it becomes clear that DALI reduces the performance compared to backtranslation alone and this contradicts the re-sults obtained by Hu et al. (2019). They asserted that DALI and backtranslation have orthogonal benefits, since DALI provides explicit translations of in-domain words while backtranslation ”utilizes the highly related contexts to translate unseen in-domain words”. An important difference in this work compared to the work of Hu et al. (2019) is the size of the copora used to train the baseline and backtranslation model. They adapt a baseline model that is trained on a corpus of approximately 300.000 sentences while this work uses more than ten million. This means that there is a large difference in baseline performance what is highlighted by the respective BLEU scores, 5.49 and 13.34. Similarly, the backtranslation model is trained on the same small corpus in the reverse direction, leading to weaker backtranslation data, and respective BLEU scores of 11.48 and 17.21 for the adapted models. As a consequence, the adapted model of Hu et al. (2019) that uses DALI and backtranslation data has a similar score as the unadapted model in this work. The lack of improvement therefore suggests that the orthogonal benefits do not necessarily extend to a high-resource scenario.

(31)

6.5 Accuracy on in-domain words

Figure 4 illustrates the prediction accuracies on the subset of words that were substi-tuted using the augmentation strategy for the two language pairs and different model types. Specifically, the English ground truth test set sentences and model predictions are tokenized and checked for presence of in-domain words. Next, an in-domain word in the ground truth sentence counts as correctly predicted when it is also present in the corresponding model prediction. This figure should therefore illustrate the improved translation of in-domain words by the different model types and can help answering RQ3. Looking at the Russian-English results, there is barely an increase in performance for the different methods, except for the backtranslation model. The lack of increase can again be due to the quality of the lexicon as many incorrect translations are substituted in the Russian sentences. This means that an explicit mapping is not learned between the true Russian translation and the English in-domain word and lowers the likelihood that it is outputted by the model. For German-English, the improvement is more apparent for all the methods that introduce in-domain word although backtranslation still performs best.

What is interesting is that there seem to be no significant complementary benefits when backtranslation is combined with the other methods. The hypothesis at the be-ginning of this thesis was that the lexicon extracted using cross-lingual word embeddings would be helpful for in-domain words. This seems to be true as, for the single methods, there is a clear increase in the accuracy of these words that is disproportionate to none in-domain words, see appendix I. However, backtranslation also correctly predicts many of these in-domain words without the need of explicit supervision and the benefits by combining these methods is negligible.

Conjecturing about the possible cause, Fadaee and Monz (2018) show that backtransla-tion data is mainly beneficial for words that have high predicbacktransla-tion losses in the baseline and that these are generally the more infrequent words. Since the in-domain words selected using the cross-entropy metric are also generally infrequent in the baseline, it suggests that the augmentation strategy and DALI are less complementary to backtranslation as conceived in advance.

(32)

(a) Single methods

(b) Mixed methods

Figure 4: Accuracy on substituted in-domain words in the test set.

(33)

7 Conclusion and Future Work

This section reflects on the four research question that were outlined at the end of the introduction and outlines possible directions of future research.

RQ1: To what extent can a lexicon be extracted from in-domain monolingual data using unsupervised bilingual lexicon induction?

The results in section 6.1 show that the size of the in-domain data alone is insufficient to induce a high quality lexicon and that larger corpora are needed. Also, the results of the domain adaptation methods, DALI and the augmentation strategy, suggest that a high coverage and accurate lexicon is critical and that this outweighs the importance of having the word embeddings represent the in-domain data only. As a consequence, training the word embeddings on just the in-domain data should only be preferred over the joined set when it provides similar coverage and accuracy.

RQ2: To what extent does fine-tuning on the augmented data improve the performance of a neural machine translation model on in-domain data?

The results in section 6.3.1 show that the augmentation strategy is beneficial for German-English but not for English and that the lack of improvement for Russian-English seems to be due to the quality of the induced lexicon. Also, a large part of the observed improvement is due to the out-of-domain sentences selected by the cBiLSTM as the observed improvement by augmenting the sentences is relatively small. For future research, it would be interesting to increase the amount of in-domain words in the lexicon and see whether this leads to more substantial improvements. Also, the results of the data augmentation model indicate that it is easier to obtain syntactically correct compared to semantically correct sentences. An obvious next step would be to train on the joint corpus of in-domain and out-of-domain, as done with the word embedding training. This could certainly improve the syntactic and possibly semantic performance although it is unclear if the cBiLSTM would still be useful for in-domain data selection.

RQ3: To what extent is the translation performance on the in-domain words influenced by adapting the model using the augmented data?

Closer inspection of the improvement on the German-English corpus shows that the error on in-domain words is indeed disproportionately reduced using the augmentation strategy proposed in this thesis, see section 6.5. For Russian-English, a similar improve-ment is not observed and this is possibly due to the quality of the lexicon since an explicit mapping between the correct source word and target word is not learned when the lexicon

(34)

has an incorrect translation.

RQ4: To what extent are the benefits of the augmented data complementary to back-translation?

For German-English, the augmentation data in combination with backtranslation data is significantly better than backtranslation data alone and this also holds for the original sentences in combination with backtranslation data. However, since the augmentation data in combination with backtranslation data is not better than the original sentences with backtranslation data, this suggests that the improvement is not due to the substi-tution of in-domain words but due to the original sentences themselves. This is in line with the lack of improvement on in-domain words when augmented data is combined with backtranslation data.

A similar pattern holds for DALI and backtranslation, which contradicts earlier re-sults by Hu et al. (2019), and this might be due to the difference in corpora sizes used to train the baseline and backtranslation model. Still, Hu et al. (2019) do show that DALI outperforms backtranslation when translating OOV words and this suggests a practical application of the data augmentation methods in combination with backtranslation. Fur-ther experimentation is necessary to show wheFur-ther these benefits extend to a high-resource scenario.

(35)

References

A. Antonova and A. Misyurev. Building a web-based parallel corpus and filtering out machine-translated text. In Proceedings of the 4th Workshop on Building and Using Comparable Corpora: Comparable Corpora and the Web, pages 136–144. Association for Computational Linguistics, 2011.

M. Artetxe, G. Labaka, E. Agirre, and K. Cho. Unsupervised neural machine translation. arXiv preprint arXiv:1710.11041, 2017.

M. Artetxe, G. Labaka, and E. Agirre. A robust self-learning method for fully unsuper-vised cross-lingual mappings of word embeddings. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 789–798, 2018.

P. Arthur, G. Neubig, and S. Nakamura. Incorporating discrete translation lexicons into neural machine translation. In Proceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing, pages 1557–1567, 2016.

J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov. Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146, 2017.

F. Braune, V. Hangya, T. Eder, and A. Fraser. Evaluating bilingual word embeddings on the long tail. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 188–193, 2018.

P. Brown, J. Cocke, S. D. Pietra, V. D. Pietra, F. Jelinek, R. Mercer, and P. Roossin. A statistical approach to language translation. In Proceedings of the 12th conference on Computational linguistics-Volume 1, pages 71–76. Association for Computational Linguistics, 1988.

K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.

Domain Adaptation of Neural Machine Translation using Monolingual Data

MSc Artificial Intelligence

Master Thesis