The Use of Semantics for Text Style Transfer

(1)

MSc Artificial Intelligence

Master Thesis

The Use of Semantics

for Text Style Transfer

by

Fije van Overeem

10373535

June 19, 2019

36 ECTS October 2018 - June 2019

Supervisors:

Prof. Dr. H. Haned

MSc. A. Lucic

Assessor:

Prof. Dr. M. de Rijke

(2)

Abstract

Recently, semantic representation languages have been shown to be useful for many NLP tasks, e.g. they have improved the BLEU score on a Neural Machine Translation (NMT) task and can alleviate the data sparsity problem (Song et al. [53], Marcheggiani et al. [40]). Semantic representations can provide extra information to the model about the content of a text and therefore we hypothesize that it can be particularly useful for the task of Text Style Transfer (TST), a task with the objective to transfer stylistic aspects to a text while preserving its content. In this thesis, we explore the use of Abstract Meaning Representation (AMR) for a TST task on different Bible versions in English and corresponding AMR graphs. We treat the TST task as a monolingual translation problem with parallel data, which allows us to use the architecture from a sequence-to-sequence model, a currently successful model in NMT. We use the Dual2Seq model from Song et al. [53], which encodes the AMR graphs from the source sentences with a Graph Recurrent Network encoder and the source sentences with an LSTM encoder. The decoder has an attention mechanism over both the source sentence and the graph. We evaluate our model quantitatively using BLEU, and qualitatively by conducting a human judgement study where respondents evaluate our results on content preservation and style transfer strength. We compare our results with the Seq2Seq baseline model from Song et al. [53], which is a standard attention-based encoder-decoder model without AMR graph encoding. We find that 1) the Dual2Seq model does not outperform our baseline model in terms of BLEU score, 2) the Dual2Seq model is evaluated to be better at content preservation than our baseline and 3) our baseline model is evaluated to be better at style transfer strength. Concluding, the Dual2Seq model does not make the same improvements over the baseline model as in the NMT experiments from Song et al. [53]. To the best of our knowledge, this is the first work to use AMR or any other semantic representation for a Text Style Transfer task.

(3)

Acknowledgements

First and foremost I wish to thank my supervisors Ana Lucic and Hinda Haned, for the great guidance and supervision they have given me over the past months. Thank you for the many discussions, suggestions and encouragement, and for letting me get to know and follow my own research interests.

Furthermore, my gratitude goes out to Ahold Delhaize and its HR Analytics team, for providing me with a work environment filled with smart and kind people.

I also would like to thank Maarten de Rijke for accepting the role as my thesis examiner. Many thanks as well to Marzieh Fadaee, Mostafa Dehghani and Maurits Bleeker, for offering me their help with the technical side of my project.

Lastly, I’m very thankful for the support of my parents and friends throughout my studies. In particular I would like to thank Daan van Stigt, who has been enormously supportive during the writing of this thesis.

(4)

4 Method 17 4.1 Task Description . . . 17 4.2 Model . . . 17 5 Experimental setup 21 5.1 Data . . . 22 5.1.1 Bible texts . . . 22 5.1.2 AMR graphs . . . 22 5.1.3 Vocabularies . . . 23 5.2 Model hyperparameters . . . 23 5.3 Evaluation setup . . . 24 5.3.1 Quantitative measures . . . 24 5.3.2 Qualitative measures . . . 25 6 Results 27 6.1 Quantitative results . . . 27 6.2 Qualitative results . . . 28

6.3 Case study on data from another domain . . . 32

7 Conclusion 34 7.1 Contributions . . . 34

7.2 Future Work . . . 36

Appendix A: Processing AMR graphs 42

(5)

1 Introduction

1.1 Text Style Transfer

There are innumerable ways in which we can communicate the same information, by using a different formulation. Consider the following sentences, equivalent in meaning:

(1) There’s a lot of stuff to do in Amsterdam.

(2) The city of Amsterdam offers a wide range of cultural activities.

Sentence (1) is shorter, less formal, less descriptive and arguably easier to read than sentence (2). These are all stylistic aspects with which the content is presented. Text Style Transfer (TST) is the task of automatically rewriting a text in a certain style, while preserving its content. TST can contribute to many real-world applications: companies could write one advertisement text and automatically rephrase it to suit their different target groups, chatbots could be specified to generate text with assigned stylistic aspects and a text simplifier could be built, making more complex texts comprehensible to a broader public, such as legal documents or archaic prose. In that context, Sadeh et al. [51] have argued there is a need for simplification of privacy policies, to which an appropriate TST model could offer a solution.

If we could disentangle style from meaning, we can detect text type, sentiment or bias in text and therefore add or remove these aspects from text as well. This would open the door for automatic creation or complementation of datasets, for example complementing a highly biased movie review dataset for the training of a sentiment classifier algorithm (Jin et al. [33]).

The concept of style transfer has gained popularity due to its success with images. An example of this work is Gatys et al. [22], where the painting styles of different artists have been modelled and used for style transfer; see figure 1.

Figure 1: Example results of style transfer on a picture, where different paintings have been used as the style component. Published by Gatys et al. [22]. A: A photo by Andreas Praefcke. B: The Shipwreck of

the Minotaurby J.M.W. Turner, 1805. C: The Starry Night by Vincent van Gogh, 1889. D: Der Schrei by

Edvard Munch, 1893.

Although style transfer results for text have advanced greatly over the last few years, the results for text are lagging behind on style transfer for images, both for supervised and unsupervised learning methods. Tikhonov and Yamshchikov [56] explain that this discrepancy is mostly due to the format of the data itself. Images consist of continuous pixel values, whereas text consists of discrete tokens. With images, one could alter a

(6)

pixel to be ‘a little more blue’. With text however, one can generally not make a word ‘a little more formal’. Furthermore, the text used for style transfer often is a limited sequence of tokens, whereas an image can consist of thousands of pixels. Therefore, it is less expressive data to represent the style to be modelled.

In recent years, researchers have taken up multiple strategies for tackling the TST task. One of them is to assume the task to be a monolingual Machine Translation problem, where we translate between two styles in one language instead of translating between two languages (Carlson et al. [9]). Automatic translation research often makes use of encoder-decoder sequence-to-sequence models, which have proven to be very powerful (Sutskever et al. [55], Cho et al. [11]). However, these models are learned in a supervised manner, making it necessary to possess parallel data. Having parallel data means to have pairs of sentences, where each sentence in a pair represents the same content but in a different language, or in our case: a different style. Parallel data for regular language translation is scarce, and for TST in particular. There are few datasets where written text is produced in multiple styles, where an alignment between sentences from both styles is available. Therefore, we conclude most work in TST either explores unsupervised methods (see Section 3), or investigates how supervised models can be improved such that we can take more advantage of the parallel data that does exists.

In this thesis, we take up on the latter approach and therefore, we need to obtain one of the few available parallel datasets. One of the possible resources for such a dataset is the Bible (Carlson et al. [9]): a highly parallel dataset, because each of the ~30,000 verses is numbered. Besides the many languages in which the Bible is published, there exist dozens of different Bible versions in English. The Young’s Literal Translation (YLT) Bible for example was published in 1862 and has an old-fashioned stylistic aspects, while the Bible in Basic English (BBE) was written to be comprehensible to the wider public and therefore has a simpler, more modern writing style. See table 1 for a sample for comparing the two Bibles.

We will use different Bible versions for an existing supervised machine translation model, while we investigate improving this model by adding semantic structure from our data, namely Abstract Meaning Representation (AMR). Encoder-decoder models in NMT are often a version of LSTM’s (Hochreiter and Schmidhuber [28]), because of their ability to capture word order and co-occurrence, which are important factors for language modelling. However, they are not explicitly designed to capture other dependencies, such as subject-verb-object relations or adjective-noun pairs, because those are sensitive to (implicit) syntactic structure and semantic roles (Chomsky [12], Everaert et al. [18]). Recent work on making these syntactic structures explicit, by leveraging syntactic trees or Semantic Role Labels (SRL), suggests doing so is in fact beneficial for numerous Natural Language Processing research, including Statistical Machine Translation (SMT) (Bazrafshan and Gildea [6]), Neural Machine Translation (NMT) (Bastings et al. [4]), and Abstractive Summarization (Liu et al. [38]).

One of the recently developed semantic representations is Abstract Meaning Rep-resentation. In this thesis, we will investigate whether the use of AMR improves the performance of TST with different Bible styles as parallel data.

(7)

YLT Bible BBE Bible Ezekiel 25:15 Thus said the Lord Jehovah:

Be-cause of the doings of the Philistines in vengeance, And they take vengeance with despite in soul, To destroy – the enmity age-during!

This is what the Lord has said: Be-cause the Philistines have taken pay-ment, with the purpose of causing shame and destruction with unend-ing hate;

Ezekiel 25:16 Therefore, thus said the Lord Jeho-vah: Lo, I am stretching out My hand against the Philistines, And I have cut off the Cherethim, And destroyed the remnant of the haven of the sea,

The Lord has said, See, my hand will be stretched out against the Philistines, cutting off the Cherethites and sending destruction on the rest of the sea-land. Ezekiel 25:17 And done upon them great

vengeance with furious reproofs, And they have known that I Jehovah, In My giving out My vengeance on them!

And I will take great payment from them with acts of wrath; and they will be certain that I am the Lord when I send my punishment on them.

Table 1: Sample from two different Bible versions: Ezekiel 25:15-25:17, from The Young’s Literal Translation Bible and The Bible in Basic English. The texts are collected from the published datasets from Carlson et al. [9].

1.2 Abstract Meaning Representation

AMR is a semantic representation language, where the meaning of a sentence is described as a graph structure where nodes represent the sentences’ concepts, and directed edges represent relations between these concepts (Banarescu et al. [3]). By representing a sentence as an AMR graph, one abstracts away from some of the syntactic structure and the specific surface of a sentence. AMR is agnostic to grammatical number (singular or plural) or tense and articles are left out of the representation. Consider the AMR graph in figure 2, representing the sentence The man is eating cheese1_:

Figure 2: AMR graph of The man is eating cheese.

The verb to eat is here recognized as part of a proposition bank2, hence we’re given some semantic information about this verb: the verb to eat can have two arguments, someone who eats (ARG0) and something that gets eaten (ARG1). This AMR graph 1_{All graphs in this thesis are visualised with the AMR reader developed by Pan et al. [45]}

2_{AMR makes use of the PropBank proposition bank (Palmer et al. [44]), which can be viewed as an extension of}

the Penn Treebank, a database providing among others a corpus of syntactic structures. The Proposition Bank adds semantic role labels to these syntactic structures.

(8)

could also represent sentences The man eats or A man ate. We will give a more profound explanation of AMR in section 2.

When we have explicit information on semantic roles, we can provide NLP models with more information to accomplish their tasks. Intuitively, the more semantic information, the more constraints there are during the prediction of sentences and thus the easier the task is. For example, if an AMR graph says there is one object and one subject concept in a sentence, a model could infer that in its prediction, there will be one object and one subject concept as well.

Preferably, a model makes use of the information that the predicted sentence should have an equivalent AMR graph as its source sentence - albeit with separate vocabularies for the source and target side since different styles make use of different vocabularies. We will further describe this in section 2.2. Research on NLP tasks with AMR includes Automatic Summarization (Liu et al. [38], Dohare et al. [16]), Question-Answering (Mitra and Baral [42]), Text Generation (Konstas et al. [37], Song et al. [54]) and

Machine Translation (Song et al. [53]).

Song et al. [53] have developed a neural encoder-decoder model where they adopt a sequence-to-sequence model and add Graph Recurrent Neural Networks to encode the AMR graphs. They demonstrated that including AMR in an NMT task has improved results in terms of BLEU score not only over models without semantic graphs, but also over models making use of other semantic representations such as Semantic Role Labeling (Gildea and Jurafsky [24]). Furthermore, they have also found AMR to help alleviate the argument-switching problem for translation (Isabelle et al. [29]). This provides evidence that the model in fact learns from the semantics the AMR graphs provide.

1.3 Contributions

In this thesis, we will apply the NMT model of Song et al. [53], named Dual2Seq, for our TST task and investigate whether including AMR is beneficial for the quality of our TST results. We will do so by training Dual2Seq models with pairs of Bible datasets published by Carlson et al. [9]. To obtain the AMR graphs for the Bible verses, we use the JAMR parser (Flanigan et al. [20]). We train the sequence-to-sequence model of Song et al. [53] (Seq2Seq) as a baseline.

We compare our results with those of the Dual2Seq model and our evaluation has both objective and subjective components: we evaluate objectively using the BLEU score and subjectively using a human judgement study. By doing so, we aim to answer the following research questions:

RQ1: Does the Dual2Seq model improve the BLEU score of a sequence-to-sequence

model without the use of AMR, when translating between Bible styles?

RQ2: Does the Dual2Seq model improve the BLEU score of a sequence-to-sequence

model without the use of AMR, when translating from an unseen Bible style?

RQ3: • Does the content of the Dual2Seq model predictions match their target sentences, as evaluated in a human judgement study?

(9)

target style more than the source style, as classified in a human judgement study?

• Is the style of the Dual2Seq model predictions distinctive enough to be classified as the target style domain, as classified in a human judgement study?

Consequently, we make the following contributions:

• We investigate the use of semantics for a TST task in general.

• We expand on the work of Carlson et al. [9] on using Bible versions for a TST task.

• We propose a qualitative method for evaluating a TST task.

We will continue with a preliminaries section in section 2, where we proceed with our descriptions of the TST task and AMR.

To the best of our knowledge, we are the first to combine semantic graphs with sequence-to-sequence modeling for the TST task. In section 3 we describe the NMT model with semantics by Song et al. [53], as well as the TST task with Bibles done by Carlson et al. [9]. We also describe some other work on the TST task, both with and without parallel datasets. In section 4 we describe our method, followed by an experimental setup description in section 5. In section 6 we will report our results. We conclude and outline some ideas for future work in section 7.

(10)

2 Preliminaries

This thesis combines two topics within the NLP research field: Text Style Transfer and the use of semantics for NLP tasks. On both topics we will provide some background. Because TST is (currently) heavily reliant on techniques normally used for Machine Translation, this topic too will be covered. We start with a section on the TST task, followed by a section on AMR. We conclude with a section on the encoding of graph structures.

2.1 Text Style Transfer

2.1.1 Defining the Text Style Transfer task

Although we formally define the TST task in the method section, in this section we discuss a more conceptual definition of what the TST should aim to do. This is not trivial — in fact, some previous work on TST has not followed a definition per se. Instead, they reverse-engineered their definition to the dataset they had under consideration, arguing that TST is defined by certain stylistic aspects, such as sentiment (Ghosh et al. [23]) or formality (Rao and Tetreault [48]).

The Machine Translation task does not have the same definition problem, because for translation it is clear whether a sentence is written in a certain language, which is not clear for any existing writing style. For example, when translating from French to English, it is relatively easy to at least classify the predicted language as being written in the target language. The same does not hold for different styles: the fact that style domains can overlap makes this difficult. An example of this is that it is obvious that the YLT Bible and the BBE Bible are written in different styles, but when looking at a sentence specifically, it is not easy to classify it as one or the other in each case.

Tikhonov and Yamshchikov [56] have identified the problem of the lack of a style transfer definition and have proposed a solution. First of all, they argue not to call the work on the transferal of defined stylistic aspects ‘style transfer’, because although effective, the approach is not holistic and does not conform to the intuitive notion of style. Instead, they state that a style should meet at least the following two criteria:

• Style is an integral component of text that allows it to be assigned to a certain category or a sub-corpus. The choice of such categories or sub-corpora is to be task-dependent.

• Style has to be ‘orthogonal’ to semantics in the following way: any semantically relevant information could be expressed in any style.

We argue some of our Bible source-target pairs do meet these criteria as complete texts, however not every sentence pair independently meets the criteria, especially when they are short. The shorter sentences are, the less expressive power they have to represent a certain writing style. Furthermore, some Bible pairs are very similar, for example the American Standard Version (ASV) and the King James Version (KJV), which differ only in some grammar alterations, or the use of Jehovah instead of Lord. In addition, some sentences are identical. A sample is given in Table 2.

(11)

Ezekiel 1:3

ASV the word of Jehovah came expressly unto Ezekiel the priest, the son of Buzi, in the land of the Chaldeans by the river Chebar; and the hand of Jehovah was there upon him.

KJV The word of the Lord came expressly unto Ezekiel the priest, the son of Buzi, in the land of the Chaldeans by the river Chebar; and the hand of the Lord was there upon him.

BBE The word of the Lord came to me, Ezekiel the priest, the son of Buzi, in the land of the Chaldaeans by the river Chebar; and the hand of the Lord was on me there.

Table 2: Sample of the ASV and KJV Bible to demonstrate they are very alike in writing style, whereas the BBE Bible is different.

Consequently, for a fair comparison it is necessary to take the similarity between source and target text into account during evaluation. In a quantitative measure, this can be done by comparing the BLEU scores (Papineni et al. [46]) between each original source and target data pair.

2.1.2 Evaluating Text Style Transfer

To evaluate any TST task with parallel data, we make use of both quantitative and qualitative evaluation measures. In Machine Translation, the BLEU score is widely used to measure the quality of the model’s predicted sentences automatically. BLEU evaluates whether words (and n-gram combinations of them) from the target sentence also occur in the predicted sentence (Papineni et al. [46]).

Note that this is not a completely comprehensive method for estimating a model’s translation quality. For example, BLEU does not score for grammaticality, and counts (correctly) predicted synonyms for target words as incorrect. However, the BLEU score does show correlation with the human evaluation given the source and target pair are sufficiently different (Wubben et al. [58]), which makes it a good indication for the translation quality of our models trained on a Bible style pair with a relatively low BLEU score among source and target side.

Another quantitative measure for the TST task is the PINC score (Chen and Dolan [10]), which measures the dissimilarity between the source and prediction sentences. We will report on the PINC score as well, because it is an indication for the TST strength of the model. A model could score low on BLEU but the change of stylistic properties could be large. However, a high PINC score might indicate many alterations from the source sentences, but that does not mean those alterations are in fact in the target’s style domain. We therefore argue that it is insightful to report the PINC score, but we do not directly infer information about the TST quality from it. Other methods for TST evaluation have been proposed as well, such as using a style-specific language model as a style classifier (Yang et al. [60]) or cosine similarity to measure content preservation (Fu et al. [21]). We will leave it up to future work to evaluate our TST models with

(12)

2.2 Abstract Meaning Representation

2.2.1 Introduction to AMR

In this section we will describe AMR. Abstract Meaning Representation was first introduced by Banarescu et al. [3], although some form of it already appeared in Dorr et al. [17]. The goal of Banarescu et al. [3] has been to obtain a large-scale semantic bank with semantic annotations, like the Penn Treebank has been a resource for many NLP tasks for syntactic annotation. We have previously explained some of the fundamentals of AMR, mostly following the guidelines for AMR annotation3_{as published by Banarescu} et al. [3]. We will continue to do so in this section.

An AMR graph G exists of (V, E ), with the nodes V representing the concepts and the edges E representing the relations between these concepts. An AMR graph is a rooted, directed, acyclic graph. It is rooted because the graph has one top concept from which all the other concepts are (indirectly) linked. It is directed because the edges are arrows in one direction, and it is acyclic, meaning no cycles in the tree are allowed. Each AMR graph represents one sentence. AMR has an unorthodox way of representing negation and makes use of recurrent variables, both of which we will illustrate by extending our simple example The man is eating cheese from section 1. We will represent our AMR graph in PENMAN notation (Bateman [5]), which is how an AMR graph is represented for processing.

(e / eat-01 :ARG0 (m / man) :ARG1 (c / cheese))

Figure 3: The AMR graph of The man is eating cheese, visualized in a graph and in PENMAN notation.

In this graph we have three events/concepts:

• (to) eat • man • cheese Furthermore,

• ":ARG0" and ":ARG1" are roles • e, m and c are variables.

• m / man indicates that m is an instance of the concept man.

Concepts are either English words, PropBank framesets (e.g. the 01 from eat-01 means the verb is recognized in the propbank) or special keywords, like dates, places, quantities or logical conjunctions (Banarescu et al. [3]). The root concept should be the main assertion of the sentence. This is often the head of a sentence (Corbett et al. [14]), which often is the main verb in a sentence as well.

(13)

If we want to represent The man is not eating cheese, we get the following AMR graph, where a polarity sign is added to the graph:

(e / eat-01 : polarity -:ARG0 (m / man) :ARG1 (c / cheese))

If we want to represent a sentence with a recurrent variable, for example in the sentence

The man is not eating the cheese that he wants, we get the following graph:

(e / eat-01 : polarity -:ARG0 (m / man) :ARG1 (c / cheese :ARG0 (w / want-01) :ARG1 m))

Here we see that the variable m is recurring in the subordinate clause. It is intuitive that text generation models can be helped by using this explicit recurrence information for correctly predicting who does what to whom in a text.

Furthermore, AMR uses the proposition banks to recognize named entities which will be represented as constants. For example, consider this more complex sentence with its corresponding AMR graph, illustrated in figure 4, where B-612 is recognized as the name for an asteroid:

Figure 4: AMR graph of the sentence "I have serious reason to believe that the planet from which the little

prince came is the asteroid known as B-612.".

This sentence is part of the translation of Le Petit Prince, a 1,564-sentence long book that has been completely annotated into AMR graphs4 _{and is estimated to be the} 4_{The complete annotated Little Prince and many more can be found at https://amr.isi.edu/download.html.}

(14)

second-most translated work in the world5_{, after the Bible.}

We can read the meaning of this AMR graph as follows: the root node of the graph is a ‘cause’-event: There is a ‘serious reason’ causing ‘I’ to believe something, defined in another subtree. This subtree consists of a ‘little prince’ ‘coming’ from/of/on a ‘planet’, which is an ‘asteroid’ named/called/being/known as ‘B-612’. B-612 is recognized as a named entity by the Proposition Bank, hence the blue square. We end up with a sentence equivalent to There is serious reason causing I to believe the little prince to

come from a planet, which is an asteroid named B-162.

An AMR graph can be viewed as a representation for multiple sentences and in fact, AMR’s have proven useful for a paraphrase detection task (Issa et al. [30]). In other words, Marcheggiani et al. [40] argue “Semantic representations provide an abstraction which can generalize over different surface realizations of the same underlying ‘meaning”’. In the light of our TST task, we could view these ‘surface realizations’ as different stylistic texts with the same meaning. In our task with parallel data, this would mean that the source and the target sentences are to be represented by the same AMR graph. This however is not exactly the case. First, existing AMR parsers do not make perfect graphs and often make mistakes when never-seen synonyms or ambiguous words are involved. Second, the meaning of a sentence is always subject to context and since we make AMR graphs per sentence it is not always able to take this context into account (Issa et al. [30]). Lastly, after abstracting, AMR graphs still capture some of the original sentence structure while for a paraphrase, a different sentence structure is allowed (compare for example the sentences All the 30 students were sitting in my classroom

and My class of 30 students was complete today).

2.2.2 AMR Parsing

For our task we need AMR graphs of our source sentences. Rather than annotating our dataset by hand, a reasonable alternative to obtain these is to make use of a parser that automatically parses our sentences into AMR graphs. Like dependency parsing, AMR parsing is a research field on its own. AMR parsers are usually trained on hand-annotated data. The quality of the parsed AMR graphs is important for the performance of the NLP task at hand. To illustrate this, Song et al. [53] have performed an experiment with their translation task on hand-annotated AMR graphs, yielding a BLEU score improvement of 4% over a the model trained on parsed AMR graphs.

Work on AMR parsing includes that of Flanigan et al. [20], Lyu and Titov [39], Konstas et al. [37], Damonte et al. [15]. Work on AMR parsing has led to multiple published AMR parsers, including CAMR (Wang et al. [57]), AMR-eager (Damonte et al. [15]), NeuralAMR (Konstas et al. [37]) and JAMR (Flanigan et al. [20]), of which the latter is the most popular.

Song et al. [53], who developed the model we use in this thesis, use the JAMR parser6 and we will use it as well. The parser obtains state-of-the-art SMATCH7 results, and is relatively user-friendly. Guo and Lu [26] discuss some limitations of the alignment part 5_{Many of which can be found at http://www.petit-prince-collection.com/lang/collection.php?lang=nl} 6_{https://github.com/jflanigan/jamr}

7_{SMATCH is an evaluation format specifically designed for the evaluation of AMR parsing quality (Cai and}

(15)

(the phase where the parser first matches the sentences’ words to their AMR graph’s nodes and edges) of the JAMR parser. Firstly, the parser often makes mistakes when a word in a sentence occurs more than once. Secondly, the JAMR aligner performs worse when corpora get larger, yielding empty alignment information.

Not only the parser is of importance for the quality of the parsed AMR graphs, but also the domain this parser is trained on. To illustrate this, an example that highlights this importance is given in Appendix A.

2.3 Encoding Sequence vs Graph Structures

Advances in graph encoding have allowed deep learning models to take advantage of graph structures in addition to purely sequential data (Kipf and Welling [35]). Therefore we shortly discuss graph structure encoding versus the normative sequence encoding.

In earlier deep learning models with graphs, such as Konstas et al. [37], an RNN encoder is used to encode the graph information. However, RNN’s require the graph to be linearized because they can only process sequential data, losing much of the structural information from the graph. For example, closely related graph nodes can end up far away from each other in a linearized graph, making it hard for an RNN to model their relation.

For a more appropriate method to encode graph structures, Kipf and Welling [35] have proposed Graph Convolutional Networks (GCN), with which they successfully demonstrated that their encoding method improves results for a link prediction task. The advantage of these models is that this type of encoder only requires relative ordering information, making it general and flexible on a lot of data structures. Consequently, Marcheggiani et al. [40] have adopted GCN’s for the encoding of Semantic Role Labels and Bastings et al. [4] used GCN’s for the encoding of dependency trees. Both have achieved improved BLEU scores for their NMT task.

Inspired by GCN’s, Song et al. [54] have proposed Graph Recurrent Neural Networks (GRN’s) for encoding a graph for their AMR-to-text model, yielding outperforming BLEU scores of Konstas et al. [37]’s linear encoder model on the same dataset. Our model from Song et al. [53], Dual2Seq, makes use of the same GRN’s to encode our AMR graphs. A formal description of this model follows in section 4.

(16)

3 Related Work

To the best of our knowledge, there is no work done yet on TST with AMR, or any other semantic formalism. The model we use in this thesis, Dual2Seq (Song et al. [53]), was originally used for an NMT task with AMR, which we will therefore discuss in this section. Furthermore, we follow the experimental setup of the TST task done by Carlson et al. [9] as much as possible and therefore also view this as related work. Consequently, we will first describe both their work in this section, followed by a brief summary of other work on TST. In the latter we will make a distinction between models trained on parallel data and models that are trained on non-parallel data.

3.1 NMT with semantics

As previously discussed, Song et al. [53] make use of GRN’s to encode AMR graphs for a Machine Translation task. They compare their model, Dual2Seq, with multiple baselines. These baselines include (1) the use of semantic role labels instead of AMR, (2) the use of dependency trees instead of AMR and (3) a model without any semantic or syntactic extra information: a plain, attention-based sequence-to-sequence model. In all experiments, the Dual2Seq model outperformed the other models in terms of BLEU score, both on a small dataset (234k sentences) and a large dataset (4.5M sentences). The performance gain was relatively larger when training on the small dataset, suggesting that the AMR graphs can help alleviate data sparsity. They have also demonstrated that in some cases, the semantic structure helped create a more grammatical translation. The Dual2Seq model also outperformed Konstas et al. [37] on their NMT task, who used AMR graphs as semantics as well, but linearized the graph structures for a regular sequence encoder, instead of making use of a graph encoder like GRN.

3.2 Text Style Transfer on Bibles with Seq2Seq

The idea to use the many existing Bible versions as a source of parallel datasets for NLP tasks is not new (Nida [43], Resnik et al. [50], Christodouloupoulos and Steedman [13]) — given the 1000+ languages in which the Bible exists, it is a great source for translation research. However, to identify these Bible versions for a TST task, Carlson et al. [9] were the first to do this.

In their research they aim to demonstrate the usefulness of these Bible versions for TST, and train SMT model Moses (Hoang and Koehn [27]) and Tensorflow’s NMT model Seq2Seq (Britz et al. [7], Abadi et al. [1]) on dozens of different Bible versions in English. This has resulted in a baseline to encourage the development of TST research. They have reported their achieved BLEU and PINC scores of the models trained on different Bible pairs and have published8 _{a part of their pre-processed datasets as well.} The models they use are state-of-the-art as well as their methods of preprocessing data (the method of tokenization and byte-pair-encoding).

However, Carlson et al. [9] only publish one-to-one pair results from the Moses model, which is an SMT model instead of an NMT model. For fair comparison, we therefore 8_{https://github.com/keithecarlson/StyleTransferBibleData}

(17)

use the sequence-to-sequence model provided by Song et al. [53] as a baseline model.

3.3 Other work in Text Style Transfer

We dedicate this section to describe work done both in TST, and make a distinction between supervised and unsupervised work. We mention that TST work exists in the same domain of that as text generation whilst controlling style or stylistic aspects, when parallel data is used. Apart from Carlson et al. [9]’s work with Bible versions, work on TST with parallel datasets includes work with Shakespeare texts (using an original and new version) by Xu et al. [59] and Jhamtani et al. [32]. Jang et al. [31] also experimented with TST with Shakespeare texts, and they proposed a new method for acquiring parallel datasets as well. They obtained a database of rap lyrics and translated them back-and-forth from French back to English. Ficler and Goldberg [19] have done work in language generation with the control of linguistic style aspects. They view style as a set of stylistic aspects such as professionalism, length, descriptiveness and sentiment. They created their own dataset by labeling movie reviews automatically based on whether the author was a professional reviewer, sentence length, percentage of adjectives used, et cetera.

In unsupervised work on TST, no parallel data is used and a model is obtained through other methods. One example of such work is Shen et al. [52], where they assume a dataset with one type of content, but in two different styles, and these sentences need not be parallel. For example, a collection of movie reviews, separated into positive and negative reviews is such a collection. Assuming a shared latent space that is style-independent, they learn style-dependent encoders and decoders to and from this space. The model is trained so that the data distributions of the decoders match the data distributions of the style they decode to, which is done by adversarial training (Goodfellow et al. [25]).

(18)

4 Method

4.1 Task Description

We formalize the problem of TST as a translation task between parallel, aligned datasets. We follow the notation by Prabhumoye et al. [47]: Assume datasets Xsent=

{x(1)_{, ..., x}(n)_{} and Y}

sent = {y(1), ..., y(n)}, where x(i) and y(i) are sequences of words

and are assumed to entail the same intent. The two datasets Xsent and Ysent represent

two different styles s1 and s2, respectively but their meaning is the same.

We train a model to generate Ysent, given Xsent. We denote the output of dataset

Xsent transferred to style s2 as ˆXsent= {ˆx(1), ..., ˆx(n)}. We define a source sentence of

length N : x(i)_{= (x}(i) 1 , ..., x

(i)

N) and a target sentence of length M : y

(i)_{= (x}(i) 1 , ..., y

(i)

M).

During training, our objective is to minimize the cross-entropy loss over the target set, which is equal to the negative log-likelihood of the target sentence:

L(θ) = − M X m=1 log pθ(y(i)m|y (i) m−1, ..., y (i) 1 , x (i) sent), (1)

with θ representing the parameters of our model.

In our model, we add AMR graphs to the task. We assume ZAM R the set of AMR

graphs, for which z(i)_{is the AMR graph of x}(i)_{. Consequently, our loss function becomes:}

L(θ) = − M X m=1 log pθ(ym(i)|y (i) m−1, ..., y (i) 1 , x (i) sent, z (i) AM R). (2)

4.2 Model

We use the model from Song et al. [53] without alterations9_{. This is an encoder-decoder} architecture that consists of two types of encoders: a BiLSTM encoder (Bahdanau et al. [2]) to encode the sentences, and a GRN (Song et al. [54]) to encode the AMR graphs. The decoder has access to the representations from both encoders. The decoder predicts the sentence on the target side, and is a doubly-attentive LSTM, meaning that it has an attention mechanism over both the sequence and the graph. In figure 5 the overall architecture of the model is displayed. We will describe our complete model in the following sections.

(19)

Figure 5: The overall architecture of the model, with English-to-German example data instead of different Bible versions. Picture originally from Song et al. [53]. The blue lines show the attention over the words in the sequence, the red line shows the attention over the nodes from the graph.

BiLSTM encoder

For the encoding of the source side sentences, a bi-directional LSTM with attention (Bahdanau et al. [2]) is used. We mostly follow Britz et al. [7]’s description of their model. In a BiLSTM, an encoder reads the input sentence, a sequence of vectors x(i) ₌ (x(i)₁ , . . . , x(i)_N), into a sequence of hidden states h(i)_{= (}−→_h

1, . . . , −→

hN). This is done with

both a forward and a backward encoder function,−→f and←−f . The forward function−→f

reads the input sequence as it is ordered (x(i)₁ , . . . , x(i)_N ) and calculates a sequence of forward hidden states h(i)= (−→h1, . . . ,

−→

hN). The backward function

←−

f reads the sequence x(i)_{in the reverse order to calculate a sequence of backward hidden states. The forward} and backward sequences of hidden states are passed on to the decoder.

Graph Recurrent Network encoder

Song et al. [54] propose a GRN encoder for encoding the AMR part of the sentence. Let G = (V, E ) be an AMR graph, with V representing the nodes and E the edges. A hidden state vector aj is used to represent each node vj∈ V. Consequently, the whole

graph g can be represented as: g = {aj_}| vj∈V.

The GRN encoder does not encode the graph as a sequence of nodes, but instead encodes the entire structure of the graph. The encoding is obtained as the final encoding of a sequence of ‘graph transitions’, in which messages are passed between the nodes. This allows for both the encoding of non-local information, as well as the encoding of the edges in the graph.

The intuition is as follows: at each graph transition, information is passed from each direct neighbour to each node. Consequently, the amount of graph transitions influences the distance at which neighbour nodes can communicate. For example, with 5 graph transitions, a node receives information from (indirect) neighbour nodes which are a maximum of 5 nodes away. This is how the GRN models global information.

More formally, the sequence of state transitions is denoted as {g0, g1, ..., gt, ..., gT},

with gt= {ajt}|vj∈V. At each graph state transition (from gt−1 to gt), for each node, the hidden state vector ajt shares its information with its directly connected neighbours.

(20)

This is done so with LSTM cells (Hochreiter and Schmidhuber [28]), hence via input, output and forget gates (ijt, o

j t, f

j

t respectively). Each hidden state vector a j

t has a cell

cj_t to store memory.

Figure 6: Graph state LSTM, picture originally from Song et al. [53] Furthermore, the edges between the nodes are represented as xl

i,j, where l is the

edge label, i is the source node index and j is the target node index (together triple (i, j, l)). More specifically, the representation for an edge (i, j, l) is defined as: xl

i,j =

We([el; evi]) + be,

where elis the embedding of edge label l and evi the embedding of vi, the source node. Weand beare trainable parameters.

The representations for input nodes, output nodes, incoming edges and outgoing edges are each first summed per group before they will go through the cell and gate nodes: φi_j= X (i,j,l)∈Ein(j) xl_i,j φo_j = X (j,k,l)∈Eout(j) xl_j,k ψi_j= X (i,j,l)∈Ein(j) ai_t−1 ψ_jo= X (j,k,l)∈Eout(j) ak_t−1, with

Ein(j) and Eout(j) denoting the sets on incoming and outgoing edges of vj. A state

(21)

ij_t = σ(Wiφij+ ∧ Wiφoj+ Uiψji+ ∧ Uiψjo) + bi (3) ojt = σ(Woφij+ ∧ Woφoj+ Uoψji+ ∧ Uoψoj) + bo (4) f_tj = σ(Wfφij+ ∧ Wfφoj+ Ufψji+ ∧ Ufψjo) + bf (5) uj_t = tanh(Wuφij+ ∧ Wuφoj+ Uuψji+ ∧ Uuψjo) + bu (6) cj_t = f_tj cj_t−1+ ij_t uj_t (7) aj_t = oj_t tanh(cj_t), (8) with W, ∧ Wo, U, ∧

U , b model parameters and σ the sigmoid function. These hidden states,

both from the BiLSTM as the GRN encoder, will be passed on to the decoder system. In figure 6 the architecture of the GRN is displayed.

Doubly-attentive LSTM decoder

The hidden states from our BiLSTM encoder will be passed to an attention-based decoder (Bahdanau et al. [2]). The attention mechanism allows the decoder to give more ‘attention’ to some words from the source sentence. More specifically, the attention is another layer which is retrieved based on the current hidden state of the decoder and the hidden states of the source sentence. This attention vector has the same size as the source sentence. The vector is normalised and provides a distribution over the words in the source sentence that are relevant for the current target word prediction.

In this model, an attention-based decoder is extended such that there is an attention mechanism for the graph part as well. This is done by defining a context vector based on the last graph state from the GRN:

gT = {a j T}|vj∈V (9) ˜ m,i = ˜vT2 tanh(WaaTi + ˜Wssm+ ˜b2) (10) ˜ αm,i = exp(˜m,i) PN j=1exp(˜m,j) (11) ˜ ζm= N X i=1 αm,iaiT, (12)

with Wa, ˜Wa, ˜v2 and ˜b2 model parameters and ˜ζmthe context vector for the graph side.

Assuming ζm the context vector for the source sentence, the probability distribution

over the target vocabulary becomes:

Pvocab= softmax(V3[sm, ζm, ˜ζm] + b3). (13)

After training of the model, we can use it to predict ˆXsent for a given input Xsent. This

(22)

5 Experimental setup

A flowchart for our complete experimental setup is given in figure 7.

Collect Bible datasets

Parse Bible verses with the JAMR parser and simplify them.

Extract vocabulary over AMR nodes plus all

training sets from each Bible

Split tokenized and Byte-Pair En-coded Bible verses with their AMRs in train, development and test sets.

Train Seq2seq and Dual2Seq models

Pair up Bibles in separate

train/dev/test datasets, containing verse ID + source AMR + source verse + target verse

Decode test set on trained model Undo tokenization and BPE on test predictions Evaluate results with human judgement

Evaluate results with BLEU and PINC measure

(23)

5.1 Data

5.1.1 Bible texts

For our experiments, we use a part of the data published by Carlson et al. [9]. We perform experiments with each of the published, pre-processed Bible pairs in an A-to-B experiment as well as in a B-to-A experiment. The Bibles we use are the Young’s Literal Translation Bible (YLT), the King James Version (KJV), the American Standard Version (ASV) and the Bible in Basic English (BBE). With these Bibles, the verse numbers have been removed, and they were split into train-, development-, and test parts. Each split contains complete Bible books: the different sections in the Bible named from ‘Acts’ to ‘Zephaniah’. Before training, the data has been pre-processed by tokenization with the Moses tokenizer10and byte-pair-encoding11, to alleviate the problem of rare words. See table 3 for the Bible styles we use. The Bible pairs have been chosen in such a way that a variety of expected style transfer difficulty was created. The Bible pairs were chosen based on the BLEU score among the source and target Bible pairs. E.g., the ASV and the KJV Bible are very alike (BLEU-score 69.09) and thus makes a simple dataset for the TST task, whereas the combination of YLT with BBE (BLEU 9.42) are very distinct in styles and make a more difficult task. For the BLEU scores among all the different Bible styles, see table 4.

Bible pair Train/Dev/Test sentences

YLT & BBE 27,595/1,642/1,835 ASV & KJV 27,608/1,637/1,835 ASV & BBE 27,585/1,637/1,835

Table 3: The number of sentences in each Bible pair.

YLT ASV BBE KJV

YLT 100 25.87 9.42 23.61

ASV 26.48 100 18.72 69.09

BBE 11.72 22.75 100 21.8

KJV 23.89 68.72 17.76 100

Table 4: BLEU scores between Bible pairs, as reported by Carlson et al. [9]. The vertical axis represents the source text and the horizontal the target text.

5.1.2 AMR graphs

For each Bible sentence in our experiments, we have obtained its corresponding AMR graph with the JAMR parser, trained on the LDC2015E68 corpus12.For parsing we have followed the JAMR guidelines: the parser takes cased, untokenized sentences. For 10_{http://www.statmt.org/moses/}

11_{https://github.com/rsennrich/subword-nmt}

12_{This is a dataset with sentences from the News/Forum domain, with their hand-annotated AMR graphs}

(24)

pre-processing the AMR graphs in the right format for the Dual2Seq model, AMR graphs have been simplified with the AMR simplifier released by Song et al. [54]13, which follows NeuralAMR’s simplifier (Konstas et al. [37]). Among others, the simplifier removes variables and parentheses from leaves, and creates a simpler representation for time and date stamps. Therefore, at least some information of the AMR graph has been lost, but it is hard to say what the impact of this exactly is. For an example of what a simplification looks like, see Appendix A.

5.1.3 Vocabularies

In each experiment, we use a shared vocabulary, this means that our vocabulary for encoding the sentences and AMRs is the same as for decoding. Our development experiments have demonstrated this to give the best results. In addition, Carlson et al. [9] who have done experiments on the same datasets, use a shared vocabulary as well. In fact, for our vocabularies, we build on the vocabulary provided by Carlson et al. [9], which consists of the top 28,812 occurring words from all train sets of our Bibles styles. Furthermore, we have created separate vocabulary files for each Bible style, where we merged the AMR node tokens from that Bible style with the general vocabulary. The edges are extracted in a separate vocabulary file during training. After creating the vocabulary files, we used the 840B GloVe pretrained word embeddings14 _{to create a} word embedding document for each vocabulary. For words outside the GloVe database, a random initial embedding was set. For details, see Table 5. For our Dual2Seq model, we use the vocabulary of the Bible style that has the source role in that experiment, both for the source as the target side. For our baseline experiments with the Seq2Seq model, we use the general vocabulary, again for both the source and the target side.

General vocab Extra AMR words GloVe embeddings/total vocab

ASV 28,812 5,356 20,947/34,168

BBE 28,812 4,533 20,431/33,345

KJV 28,812 5,797 20,996/34,609

YLT 28,812 4,396 20,858/33,208

Baseline 28,812 0 19,728/28,812

Table 5: Description of the vocabularies used in the experiments.

5.2 Model hyperparameters

We aimed to make the sequence encoder and decoder part of both Dual2Seq and Seq2Seq as similar as possible. Most of our final model hyperparameters have been displayed in Table 6. Furthermore, for both models, we use the Adam optimizer (Kingma and Ba [34]) and a beam decoder with a size of 32. During training, the model is evaluated on the cross-entropy loss on the development dataset. For Seq2Seq, we trained for 40 epochs instead of Dual2Seq’s 50 epochs, because development experiments demonstrated 13_{https://www.cs.rochester.edu/~lsong10/downloads/amr_simplifier.tgz}

(25)

the model’s development loss converged faster. We fixed our word embeddings, so they are not updated during training.

Our BLEU score results kept improving slightly whilst enlarging our models’ di-mensions, therefore it is expected that both Dual2Seq and Seq2Seq could benefit from bigger dimensions and multiple hidden layers. However, we hypothesize that the model can not take full advantage of enlarging our model’s dimensions, because the datasets are small15. Additionally, the memory of our available hardware was not sufficient to perform experiments with larger models.

Model hyperparameters Dual2Seq Seq2Seq

Batchsize 64 64

|V |, the vocabulary size 33k 28k

Attention vector size 500 500

Min/max nr of hypothesis words 2/50 2/50

Nr of epochs 50 40

Learning rate 0.001 0.001

L2 regularization value 0.001 0.001

Dropout rate 0.1 0.1

Hidden layer dimension 500 500

Word embedding dimension 300 300

Context vector size 500

-Max nr of ingoing/outgoing neighbours 2/20

-T , the number of state transitions 15

-Neighbour vector dimension 500

-Table 6: List of model hyperparameters

Our final model Dual2Seq has around 30M trainable parameters, whilst Seq2Seq has 20M. Given that Dual2Seq’s hyperparameters that are not part of the GRN graph encoder are similar to those of Seq2Seq, it is likely that the extra trainable parameters are all coming from the graph encoder part of our model. The Dual2Seq model takes approximately 8 hours to train for 50 epochs on GPUs, where the Seq2Seq model takes around 3 hours to train for 40 epochs on GPUs.

5.3 Evaluation setup

5.3.1 Quantitative measures

We evaluate our Dual2Seq and Seq2Seq models with the BLEU and PINC score. To obtain the BLEU and PINC scores, we use scripts from Moses16_{, the same ones as} Carlson et al. [9]: For BLEU we use multi-bleu.perl on our prediction and target sentences and for PINC we use the PINC.perl script on our prediction and source sentences.

15_{Note that in Song et al. [53]’s NMT experiments, they used similar parameter settings with a dataset almost 8}

times as big. On the other hand, their experiments involved translation between different languages instead of solely two styles of English.

(26)

5.3.2 Qualitative measures

For our qualitative assessment of our TST experiment, we perform a human judgement study on our predicted sentences from the Dual2Seq model and the Seq2Seq baseline trained on our YLT-to-BBE data pair. YLT-to-BBE was the most difficult data pair for any model to train on, based on the low BLEU score between the two Bibles styles (9.42). We have collected the respondents’ answers via a Google Form, asking users to evaluate individual sentences. The complete survey layout can be found in Appendix B.

The questions that were asked, were inspired by Fu et al. [21] who proposed to evaluate the TST task by seperately evaluating ‘content preservation’ and ‘transfer strength’. The human judgement study was divided in three parts:

1. A. Content preservation

2. B. Style evaluation

3. C. Style matching

Note that parts B and C both evaluate the style transfer strength. In part A, users were asked to evaluate whether the prediction sentence had the same intent17 as its corresponding target sentence. The users could choose between Yes, Partly and No.

In part B, the users were asked if the style of the prediction sentence was more like the original source sentence or the original target sentence. The source and target sentences were randomly swapped in the question. The possible answers consisted of:

• Style 1 (strong preference)

• Style 1 (weak preference)

• Style 2 (weak preference)

• Style 2 (strong preference)

In part C, the users were presented with four sentences from the source Bible (YLT) and the target Bible (BBE), although they were not told which text belonged to which Bible. For the prediction sentence, they had to indicate to which text it belonged. The possible answers were the same as in Part B.

We chose 10 Bible verse numbers from our YLT-to-BBE test set and gathered the accompanying predictions from both Dual2Seq and Seq2Seq. The users were not specifically told that the prediction sentences were modeled after any of the two given reference sentences. For the selection of Bible verse numbers, we selected sentences based on their length (at least more than 10 words, such that the sentences would be stylistically distinct enough) and the fact that it did or did not contain apostrophes. Although we consider punctuation as a valid part of a text style, we found that the YLT sentences have a very distinct way of using apostrophes, such that if all samples would contain apostrophes, the task would become trivial. Therefore, we selected five sentences originally containing apostrophes and five sentences without them.

To assess both models fairly, we created two versions, survey A and survey B, where the Dual2Seq and Seq2Seq predictions were mixed randomly. The users were asked to do either survey A or survey B, such that each user would evaluate five sentences 17_{We used the word intent in the user study instead of content, as we found that this way the question was more}

(27)

from both models. Because it is not a straightforward task to evaluate the predicted sentences, the users were given some example questions and answer for part A, which can be found in Appendix B. There were 18 respondents in total, of which 6 are male and 12 are female. 9 respondents took survey A and 9 respondents took survey B. Both survey A and survey B were filled out by one native English speaker. The respondents have an average age of 29.9 years with a standard deviation of 10.98.

(28)

6 Results

6.1 Quantitative results

The results for our quantitative experiments are given in table 7. We observe that for the experiments where the test sets are combined with models trained on the same data pair, the BLEU scores for the baseline model outperform our Dual2Seq model’s BLEU scores by between 0.67 and 2.5 BLEU points. Like in the work of Carlson et al. [9], our neural models are outperformed by the statistical model Moses.

The Dual2Seq models yield a higher PINC score than the Seq2Seq model, except on the BBE-to-ASV train- and testset. The difference ranges from 0.09 PINC points to 5.05. There are larger differences with the Moses model (range 7.64 to 20.6 PINC points difference), which is in line with the observations from Carlson et al. [9], that Moses uses more words from the source sentence (‘conservative prediction’) than their neural model.

Overall, the differences in BLEU score and PINC score among Bible style pairs are as expected, namely that the data pairs that are very similar in style (KJV and ASV) yield higher BLEU scores and lower PINC scores than the data pairs that are relatively distinct in style (YLT and BBE).

BLEU BLEU BLEU PINC PINC PINC

Train set Test set Dual2Seq Seq2Seq Moses* Dual2Seq Seq2Seq Moses*

ASV-to-BBE ASV-to-BBE 32.05 34.01† - 67.36 62.68 BBE-to-ASV BBE-to-ASV 27.84 28.74† 31.28 65.40 67.63 47.03 ASV-to-KJV ASV-to-KJV 69.55 71.95† - 22.58 18.62 -KJV-to-ASV KJV-to-ASV 70.06 71.54† 76.77 20.56 16.96 9.32 BBE-to-YLT BBE-to-YLT 19.41 20.22† - 81.75 76.70 -YLT-to-BBE YLT-to-BBE 22.32 22.99† 24.01 79.23 77.00 66.47

Table 7: BLEU and PINC scores of our models. *Moses is not a fair comparison model but we write down the results for perspective on the state-of-the-art of the specific datapair. † represents a significant result (Koehn [36], p <0.01).

We have also experimented with decoding test sets with another source Bible style than the model was trained on. The results for these experiments are given in table 8. Interestingly, our Dual2Seq model yields a higher BLEU score than Seq2Seq on the model with training set KJV-to-ASV and test set BBE-to-ASV, however with a small margin of 0.31 BLEU points and not significant. The Seq2Seq model outperforms the Dual2Seq model on the other two Bible pairs, with a BLEU score improvement of 0.50 for the YLT-to-BBE testset and 3.18 for the KJV-to-ASV testset.

We conclude that for none of the Bible pairs we have trained the TST models on, Dual2Seq outperforms our baseline in terms of BLEU score. Lastly, for all Bible pairs except one, the Dual2Seq model outperforms Seq2Seq in terms of PINC score, meaning that Dual2Seq made more alterations on the source side than Seq2Seq. We do not know if these alterations are in the target style domain.

(29)

BLEU BLEU PINC PINC Train set Test set Dual2Seq Seq2Seq Dual2Seq Seq2Seq

ASV-to-BBE YLT-to-BBE 17.61 18.11† 72.06 67.79

BBE-to-ASV KJV-to-ASV 33.31 36.49† 61.58 57.11

KJV-to-ASV BBE-to-ASV 22.58 22.27 73.43 73.34

Table 8: BLEU and PINC scores of the models were tested on a different source Bible style than the model was trained on. † represents a significant result (Koehn [36], p<0.01).

6.2 Qualitative results

In this section we discuss the results of the human judgement evaluation. More detailed insights on the complete results are to be found in Appendix B, together with the 10 sentences that were to be evaluated by the respondents and the example questions that respondents were given before filling in the answers. First, we summarize an aggregation of the results from the human judgement evaluation in table 9. Afterwards we report on the results for each separate part of the study.

Correct answers for Dual2Seq Correct answers for Seq2Seq

Part A 64/90 53/90

Part B 74/90 85/90

Part C 70/90 73/90

Table 9: Summary of the human judgement evaluation study. A ‘correct answer’ for part A means the answer was Partly or Yes. A ‘correct answer’ for part B and C means the answer was a weak preference or

strong preferencefor the target Bible style: the BBE Bible. For each part, a model could get 90 correct

answers: 9 responses per sentence.

Part A: Content preservation

In this part of the survey, respondents were asked if the studied sentence had the same intent as its equivalent original target sentence from the BBE Bible. A visualization of the results is given in figure 8.

Figure 8: Answers to part A of the study. On the x-axis: sentence 1-10.

We investigate for how many predictions the majority of the users evaluated the sentences with either Partly, or Yes. For Dual2Seq, this was the case for 8 out of 10 predictions. For the Seq2Seq baseline, this was true for 6 out of 10 predictions.

(30)

Both for Dual2Seq as Seq2Seq, the prediction’s content for sentence 3 was considered the least similar to the original version. The sentence where both models were evaluated as the best in matching the intent, was sentence 8. Both are given in table 10. We hypothesize that the phrases that occur in sentence 3 are rarely seen in the training data, whereas sentence 8 might be easy to classify because of the difference in using

Lord or Jehovah.

Model Sentences

3 YLT (source) Not yet unto blood did ye resist – with the sin striving;

BBE (target) Till now you have not given your blood in your fight against sin:

Dual2Seq But you have need of blood for blood, and have no part in the wrong.

Seq2Seq For it is not right to come to blood to be moved with the sinner.

8 YLT (source) Because thou hast not hearkened to the voice of Jehovah, nor didst the fierceness of His anger on Amalek – therefore this thing hath Jehovah done to thee this day;

BBE (target) Because you did not do what the Lord said, and did not give effect to his burning wrath against Amalek. So the Lord has done this thing to you today. Dual2Seq Because you have not given ear to the voice of the Lord, or the heat of his wrath on Amalek; for this is the Lord has given you this day. Seq2Seq Because you did not give ear to the voice of the Lord, and did not let his wrath

go on Amalek, for this cause the Lord has done this thing.

Table 10: Sentence 3 and 8: for both models evaluated as respectively the worst and best in matching the target’s content.

Furthermore, both models receive an equivalent evaluation on the rest of the sentences, except for sentence 1 and 2, where the Dual2Seq predictions were evaluated to preserve the target’s meaning more often than the predictions from Seq2Seq. These sentences are given in table 11. We argue that in sentence 1, a large part of the style transfer by Seq2Seq went well, however some phrases have altered the meaning awkwardly. For sentence 2, we suspect users answered the Seq2Seq’s prediction did not have the same intent, because the model translated Jonathan to David. It is possible that the Dual2Seq model was helped here by the source sentence’s AMR graph to translate the different entities correctly, however this remains speculation.

(31)

1 YLT (source) And David smiteth them from the twilight even unto the evening of the morrow, and there hath not escaped of them a man, except four hundred young men who have ridden on the camels, and are fled.

BBE (target) And David went on fighting them from evening till the evening of the day after; and not one of them got away but only four hundred young men who went in flight on camels.

Dual2Seq So David overcame them from the evening, to the evening of the day after, and the four hundred young men who had been stretched out on the camels and came to flight.

Seq2Seq And David sent them from the morning till the evening of the day; and there was no one of them, but only two hundred young men who were seated on the camels and went in flight.

2 YLT (source) and Jonathan addeth to cause David to swear, because he loveth him, for with the love of his own soul he hath loved him.

BBE (target) And Jonathan again took an oath to David, because of his love for him: for David was as dear to him as his very soul.

Dual2Seq And Jonathan would put David to his oath, because he was dear to him; for he had love for him.

Seq2Seq And David made a request to David that he was dear to him, because he had love for his soul.

Table 11: Sentence 1 and 2: two sentences where Dual2Seq preserved the meaning and Seq2Seq did not, as evaluated with the human judgement study.

Part B: Style evaluation

In part B, we asked users to consider the predicted sentence and choose which of the two reference sentences (the source and target equivalent of the sentence) matched the style of the predicted sentence the most. We visualize the results of the answers for this part in figure 9. 1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 9 9 9 7 8 6 6 8 5 9 9 8 9 8 8 9 8 8 9 9 # Correct answ ers Dual2Seq Seq2Seq

Figure 9: The results for Part B of the human judgement study.

We interpret the results as follows: a prediction was classified as the target style if at least 7 out of 9 respondents answered they had a preference for the target style over the source style. For Dual2Seq, this is the case for 7 out of 10 sentences. Following this

(32)

interpretation, for Seq2Seq all 10 predictions were classified as the target style. Two of the specific sentences where Dual2Seq did not transfer the style well are displayed in table 12.

6 YLT (source) (for nothing did the law perfect) and the bringing in of a better hope, through which we draw nigh to God.

BBE (target) (Because the law made nothing complete), and in its place there is a better hope, through which we come near to God.

Dual2Seq ( For it is only the law that the law is true, and the making of being true hope, through which we are near God.

Seq2Seq ( Because we did the law complete; and the coming part of a better hope, through which we are to God.

9 YLT (source) looking diligently over lest any one be failing of the grace of God, lest any root of bitterness springing up may give trouble, and through this many may be defiled;

BBE (target) Looking with care to see that no man among you in his behaviour comes short of the grace of God; for fear that some bitter root may come up to be a trouble to you, and that some of you may be made unclean by it;

Dual2Seq Let no one who is able to keep away from the grace of God, so that the root of the earth may be moved to wrath, and through this number may be made unclean.

Seq2Seq And all who have been made responsible for fear of the grace of God, for fear that the root of a bitter root may come on them, and that through this great number may be made unclean.

Table 12: Sentence 6 and 9: for these sentences, the style from the predictions from Dual2Seq is not distinctive enough to be classified as the target style.

For these sentences, there were relatively not many specific stylistic differences between the source and target sentence in the first place. However Seq2Seq mimicked the first part of sentence 6 better than Dual2Seq, which probably made users choose the target style as an answer more often.

Part C: Style matching

In part C, we first present the user with a snippet of text, in both the source and the target style. We ask to evaluate the prediction’s sentence on style again, and ask them to choose which snippet’s style the prediction belongs to the most. A summary of the results is given in table 10. We interpret the results the same as in Part B: a prediction was classified as the target style if at least 7 out of 9 respondents answered they had a preference for the target style over the source style.

Following this interpretation, for Dual2Seq, the transfer strength was again not strong enough for sentence 6, 7 and 9, as well as for sentence 3. For Seq2Seq, the style transfer strength was debatable for sentence 3, 6, 8 and 9. There is a clear overlap between the models, which is an indication that for those sentences it is hard for the model to transfer the target’s style to the prediction.

(33)

1 2 3 4 5 6 7 8 9 10 0 2 4 6 8 9 8 6 9 8 5 5 7 5 8 9 8 6 8 8 5 9 6 6 8 # Correct answ ers Dual2Seq Seq2Seq

Figure 10: The results for Part C of the human judgement study.

6.3 Case study on data from another domain

To investigate how our model would perform on data from another domain (where probably a lot of words are outside of our models’ vocabulary), we decode sentences from The Little Prince with their hand-annotated AMR graphs. We decode with a Dual2Seq and Seq2Seq model that is trained on a BBE-to-YLT model. From all our experiments’ Bibles, we argue that BBE is probably the closest to The Little Prince’s writing style.

In table 13 we show some of the resulting predictions. We observe that the model has in fact transfered some stylistic aspects of the YLT Bible on the original sentences, e.g. ye instead of you and hath instead of was. Although the sentences are mostly non-sensical, we hypothesize that if we wanted, we could match each prediction to the right original source sentence. A last observation is that in sentence 3, Dual2Seq incorrectly translates little prince to little Philip. We are positive that this odd mistake is made because of faulty AMR parsing. Like in our example in Appendix A, the JAMR parser incorrectly parsed ‘Prince’ to the named entity ‘Prince Philip, Duke of Edinburgh’, which occurs 10 times in our BBE-to-YLT trainingset.

The Use of Semantics for Text Style Transfer

MSc Artificial Intelligence

Master Thesis