Integrating a Language Style Matching Objective into Deep Neural Networks for Dialogue Response Generation

(1)

Integrating a Language Style Matching

Objective into Deep Neural Networks for

Dialogue Response Generation

(2)

Layout: 11pt, LA_TEX.

(3)

Integrating a Language Style Matching Objective

into Deep Neural Networks for Dialogue

Response Generation

B.W. van Vulpen 11865210

Bachelor thesis Credits: 18 EC

Bachelor Kunstmatige Intelligentie

University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dr. Svitlana Vakulenko Informatics Institute Faculty of Science University of Amsterdam Science Park 904 1098 XH Amsterdam Jun 26th, 2020

(4)

Acknowledgements

Throughout this research and the writing of this thesis, I have received a great deal of support and computational resources.

First, I would like to thank my supervisor Svitlana Vakulenko for the support and inspi-ration during this research and writing of this thesis.

I would also like to thank SURFsara for providing their Lisa GPU cluster, which made training and evaluating models a lot faster.

(5)

Abstract

In social interactions, people tend to mimic each others use of natural language, which benefits both participants. Language style matching is a measure to analyze such mimicry in dialogues based on the synchronized use of specific function words.

In an attempt to integrate language style matching into a dialogue response generation model, this research explores and implements loss integration and weighted decoding in a state-of-the-art Transformer model called GPT-2.

The results of this research indicate that the approach with loss integration failed to inherently learn to mimic the conversation partner’s language style, whereas the approach of weighted decoding did show some improvements in terms of language style matching.

(6)

1 Introduction

Mimicry in a dialogue has bidirectional influences and benefits both participants of that dialogue (Stel & Vonk, 2009). In general, nonverbal and verbal mimicry can occur be-tween friends (McFarland, 2001), strangers (Chartrand & Bargh, 1999) and members of a group (Lakin et al., 2008). The basic tendency to mimic others also occurs in natural language. Function words tend to play a role in natural language mimicry, in which they are also related to several emotions. Language style matching, or LSM, is a term to describe the synchronized use of function words during verbal social interaction. Previ-ous research has established that across social interactions, people engaging in a dialogue instantly matched and proceeded to match each other’s use of function words throughout the conversation (Niederhoffer & Pennebaker, 2002).

Dialogues can be created by people, but also by machines or neural networks. In one of the main research fields of artificial intelligence, Natural Language Processing (NLP), a large number of models have managed to generate natural language. In NLP’s subfield of neu-ral dialogue response generation, approaches using Recurrent Neuneu-ral Networks (RNNs) and Long Short-Term Memory (LSTM) architectures have been the state-of-the-art for a long time (Wen et al., 2015). However, in recent years, a time-independent gener-ative architecture has evolved. This model architecture, called Transformer (Vaswani et al., 2017), mainly uses attention mechanisms to accurately generate text and dialogue responses. The main advantage of these Transformer models is that they are flexible, applicable for Transfer learning (Tan et al., 2018) and outperform the traditional LSTM-based language generation models.

The aim of this research is to establish a language style matching objective in a state-of-the-art Transformer model for dialogue response generation. Therefore, the research question will be as follows:

What technique should be used to explicitly integrate a language style matching objective into a dialogue response generation model?

The Transformer model will be engendered and fine-tuned on a chit-chat dialogue dataset. The ultimate goal is to make this model habituate and mimic the language style of the dialogue partner in a way that it could authentically be utilized by the most widely used chatbots these days.

The exploration of the possibilities of integrating LSM in a dialogue response genera-tion model starts by describing the dialogue response generagenera-tion task in Secgenera-tion 2. The background and related work in the fields of language style and neural dialogue response generation will be assessed in Section 3 and 4. The method and experimental setup will be covered by Section 5 and 6. Section 7 assesses the results, followed by the conclusion, discussion and future work in Section 8.

(8)

2 Dialogue Response Generation Task

In the dialogue response generation task, the goal is to generate a response to the utter-ance of the dialogue partner, based on the dialogue history and any additional context. Think of a dialogue training sample x, containing a sequence of N utterances by two speakers.

x = {x1, x2, x3, ..., xN} (2.1)

The sequence x consists of utterances from both dialogue participants. Each utterance xt is a sequence of word tokens at timestep t of variable length St such that:

xt= {x1t, x2t, x3t, ..., xStt} (2.2)

Each token of xt is in the vocabulary V , so xjt ∈ V . The history of the dialogue xth

at timestep t contains all utterances that have been made up to and including t. Any additional, time-independent context is represented by c.

xt_h = {x1, x2, x3, x4, ..., xt} (2.3)

The goal of the dialogue response generation task is to, given history xt_h and context c, generate response yt at timestep t, which is a sequence of word tokens with variable

length Utwhere each token y_tj ∈ V .

yt= {y1t, y2t, y3t, ..., yUtt} (2.4)

During training, the goal is to minimize the loss between the distribution of the predicted response y_tand the target response x_t+1.

(9)

3 Background

In this section the background of language style matching will be discussed. The purpose of this section is to give a solid grasp of this topic and provide a better understanding of the proposed integration methods in this research.

3.1 Mimicry in social interaction

In dialogues and other verbal or interpersonal interactions, people tend to mimic each others behavior (Chartrand & van Baaren, 2009). This can take the form of verbal mimicry and nonverbal mimicry. Verbal mimicry refers to the synchronized use of lan-guage and intonation between two conversation partners. For example, when someone uses a loud voice with much swearing, the dialogue partner will tend to speak in such a way as well. The same applies for nonverbal mimicry, for example when people synchro-nize their sitting posture when talking to each other. Usually, most of this verbal and nonverbal mimicry happens unconsciously (Chartrand & van Baaren, 2009).

Verbal and nonverbal mimicry occur essentially between friends (McFarland, 2001), but also in groups of strangers (Yabar et al., 2006). While mimicry functions as a pipeline for several psychological elements like emotions (Hatfield et al., 1993), attitudes (Ra-manathan & McGill, 2007), rapport and affiliation (Chartrand & Bargh, 1999), it also benefits social interaction. The synchronized use of nonverbal elements in social inter-actions has social benefits such as liking, rapport, affiliation, safety and cohesion, also referred to as the Chameleon effect (Chartrand & Bargh, 1999). For verbal mimicry, a less studied phenomena, such an effect exists as well, where using the same words also stimulates social benefits like rapport, affiliation and cohesion between the participants (Kulesza et al., 2014).

3.2 Function words as language style

In verbal mimicry, which carries over into natural language, an important distinction should be made between language content and language style. The language content of a (spoken) text describes what the person is semantically saying or writing. It includes all kinds of words that give semantic meaning to the text, which can be nouns and verbs, but also adverbs and adjectives. Language style determines how the person is characterizing the content of the text. The style of a text can be described with function words. Function words can be narrowed down into several word categories like articles, prepositions, adverbs, conjunctions and quantifiers. All those word categories have a specific function to the characterization and style of (a part of) the text. Previous research has shown that function words are related to several types of social emotions or states, such as depression (Rude et al., 2004), leadership (Chung & Pennebaker, 2007) and status (Kacewicz et al., 2014). Honesty has also been found to be related to the use of function words (Newman et al., 2003). For instance, when someone is lying about something, he or she will make different use of function words than when he or she would be honest. Another example is that when people are angry, they will tend to use different

(10)

function words than when they are happy.

Different people have different emotions, which are reflected into their language style. Since the use of function words is inherently emotional and social, it can be used as a powerful representation of a person’s language style (Ireland & Pennebaker, 2010).

3.3 Language style matching

Language style matching, or LSM, is a term referring to a technique that analyzes the harmonized use of function words in social interaction, introduced by Niederhoffer & Pennebaker (2002). LSM measures the similarity of the use of different function words between conversation partners in a dialogue. Ireland & Pennebaker (2010) designed a measure to calculate the ‘matching’ of a specific function word category using the percentage of words from that category used throughout the conversation. This measure will be used to integrate language style matching in a dialogue response generation model. More details on this measure and its integration will be assessed in Section 5.

In addition, language style matching has been explored in substantially more areas. For example, a previous study has shown that LSM leads to greater perceptions of empathy during therapeutic sessions (Lord et al., 2014). Previous studies also contend that LSM affects cooperation. For instance, it has been shown that LSM could predict the level of agreement in diplomatic negotiations (Bayram & Ta, 2018). LSM seems to magnify the positive or negative tenor of cooperative interactions (Bowen et al., 2017; Yilmaz, 2016) as well. Lastly, previous studies also suggest extended LSM measures, such as incorporating turn-taking (Müller-Frommeyer et al., 2018).

(11)

4 Related Work

Chatbots are the most commonly known application of dialogue response generation models. In a chatbot, a real person talks to a ‘robot’ or ‘bot’. Using the context (prior given contextual information) and the previous utterances in the conversation, a dialogue response generation model (or bot) is capable of responding to the real person with a human-like response. Several ways exist to implement such a model, but the most widely used type of model is a sequence to sequence model.

4.1 Sequence to sequence models

Sequence to sequence or Seq2Seq models are designed to map an input sequence to an output sequence, where there may be a difference between the input length and output length (Sutskever et al., 2014). Seq2Seq models operate on the neural network architecture and can be used for a great deal of NLP-tasks like machine translation, text generation, image captioning, dialogue response generation and text summarization. For example, if you take a sequence of dialogue utterances as input, a well-trained Seq2Seq model is able to generate a human-like response as output.

The Seq2Seq architecture mainly consists of two parts: an encoder and a decoder. In its simplest form, the encoder consists of t RNN or LSTM units that each take a single element of the input sequence. Such an input element can be, for instance, a token or word of an English input sentence that has to be translated to Dutch or an utterance in a dialogue that needs a reply. Each input element at timestep t is represented as x_t and each hidden state ht, which is the output of a unit at timestep t, is calculated as follows:

ht= f (Whhht−1+ Whxxt) (4.1)

Equation 4.1 in fact represents the hidden state calculation of an ordinary RNN, because the belonging weights W are applied to the previous hidden state ht−1 and the input

vector xt. The hidden state h0 represents the initial hidden state of the encoder. As can

be seen in Figure 1, the encoder first embeds each word x_t using embedding techniques like GloVe (Pennington et al., 2014) or Word2Vec (Mikolov et al., 2013) to map the words to a vector representation. Eventually, the final hidden state vector of the encoder hencis

calculated using Equation 4.1 as well. This vector provides an approximate encapsulation of all information about the input elements. Subsequently, the vector is used as the initial hidden state during decoding, in order to let the decoder produce accurate predictions based on the information about the input elements.

(12)

Figure 1: The encoder-decoder structure in a sequence to sequence model The decoder likewise consists of RNN or LSTM units that each take the previous hidden state dt−1 and the previous output yt−1 as input and produce an output yt and a new

hidden state dtat timestep t. The hidden state dtis also calculated based on the previous

hidden state:

dt= f (Whhdt−1) (4.2)

The final output yt is the softmax of the current hidden state with weight matrix WS

(see Equation 4.3). After softmax, a probability vector is created that could for example be used in greedy inference, where the word with the highest probability corresponds to the output word.

yt= sof tmax(WSdt) (4.3)

Usually, the initial y0 is the begof-sentence (<BOS>) token and as previously

in-dicated, for the decoder, d₀ is equal to h_enc (see Figure 1). For a dialogue response generation task, the output sequence consists of all words yi from the response, where i

represents the order.

Using a Seq2Seq model, uncorrelated sequences of different lengths can be mapped to each other, which makes it a powerful architecture. Additionally, the Seq2Seq formula-tion for dialogue response generaformula-tion assumes (addiformula-tional) context and history-response pairs in the training set, which differs from the original Seq2Seq paper, that focused on machine translation (Sutskever et al., 2014).

4.2 Transformer architectures for Seq2Seq modeling

For many years, popular methods for Seq2Seq modeling have been based on RNNs and LSTMs. These models use sequential processing, which means that sentences are passed in word-by-word. They also retain past information through past hidden states, where they follow the Markov assumption, in which each state is only dependent on the pre-viously seen state. Due to the Markov assumption, RNNs and even LSTMs sometimes

(13)

fail to capture long-term dependencies in and between sequences (Bengio et al., 1993). Furthermore, due to the use of sequential processing, these models cannot be trained in parallel, which increases the training time.

In 2017, Vaswani et al. introduced the Transformer, a new type of Seq2Seq architecture. Tranformers do not use sequential processing and recurrent mechanisms (as explained in Section 4.1), but process sentences as a whole instead of word-by-word. The non-sequential aspect allows the model to be trained in parallel, which diminishes training time. Furthermore, the avoidance of recurrent mechanisms and the use of attention al-low the model to strongly capture long-term dependencies. Additionally, Transformers use positional encoding to encode the sentences in such a way that each token is en-coded related to a specific position. Lastly, multi-head attention and self-attention were introduced in the Transformer, which are the powerful units behind its architecture. 4.2.1 Attention

In Neuroscience, attention has been a largely studied topic. These studies also involved visual attention, in which people focus on different parts of their visual inputs and sub-sequently generate a response (Summerfield et al., 2006; Borji & Itti, 2012).

Attention in Deep Neural Seq2Seq Networks addresses the same idea of focusing on spe-cific parts of the input and identifies relevant (cor)relations between two sentences. The mechanism takes two sentences and turns them into a matrix where the rows correspond to the words in the first sentence and the columns to the words in the second sentence. Each element in the matrix identifies a relevancy measure between two words (see Figure 2 for an example with neural machine translation).

Figure 2: Example of an attention relevance matrix in machine translation, (Bahdanau et al., 2014)

(14)

The matrix as seen above can be created with two different sentences, but also with one sentence. This refers to self-attention, in which the rows and columns in the relevance matrix correspond to the words of the same sentence. Self-attention helps a neural network to find correlations within a sentence itself. For example, the word ‘he’ in the sentence ‘I like Peter when he laughs.’, will have a high relevancy measure with the word ‘Peter’. Another important form of attention in Transformer models is multi-head attention, which consists of several single attention multi-heads, where each multi-head applies the attention mechanism in a different representation subspace. Additionally, multi-head attention gives an expansion to the model’s ability to focus on different positions (Vaswani et al., 2017).

A detailed mathematical explanation of attention and its forms in a Transformer model is beyond the scope of this thesis, but for more details see Attention is all you need from Vaswani et al. (2017).

4.2.2 State-of-the-art architectures

Today, different state-of-the-art Transformer architectures exist for neural text genera-tion. Most of these architectures are also applicable for Transfer learning. For example, a model will first be trained on a large dataset of text, which will make it learn to generate text. Subsequently, using these trained weights, it can be fine-tuned for a specific task like dialogue response generation.

Common well-performing models that are known today are GPT-2 (Radford et al., 2019) and BERT (Devlin et al., 2018). GPT-2 is constructed with Transformer decoder blocks only, while BERT is built with Transformer encoder blocks only. Where GPT-2 is auto-regressive in nature, which means that each token is predicted based on the previously produced tokens, BERT is not auto-regressive and uses the entire context of the sur-rounding tokens at once.

Both models are released as pre-trained (on large internet datasets) models that can be fine-tuned for several tasks like machine translation, text generation, text summarization, question answering and dialogue response generation.

(15)

5 Methods

5.1 LSM score

In order to calculate the LSM score between two sequences of utterances from two con-versing speakers, the number of function words in both sequences needed to be counted. The function words were split up into eight categories: articles, prepositions, adverbs, auxiliary verbs, personal pronouns, impersonal pronouns, conjunctions and quantifiers. The function words were loaded into a dictionary where the keys correspond to the func-tion word category and the values contain the words belonging to that category (see Appendix A).

For each function word category, the number of occurrences from words belonging to that category were counted in both sequences. Using these counts, the percentage repre-senting the use of that specific function word category was calculated. Subsequently, the LSM score between speaker one and speaker two for that specific function word category was calculated using these percentages. In Equation 5.1, the formula, as proposed by Ireland & Pennebaker (2010), for calculating the LSM score is written down, where p₁ corresponds to the function word category percentage of speaker one and p2 to that of

speaker two. Furthermore, 0.00000001 was added to the denominator in order to avoid division by zero.

LSM score = 1 − |p1− p2| p1+ p2+ 0.00000001

(5.1) Eventually, the mean of all LSM scores for each category was taken as a total LSM score. The lower the LSM score, the worse both dialogue participants match in terms of lan-guage style. When the LSM score equals 1.0, it means their lanlan-guage style is identical. For the integration of LSM into a dialogue response generation model, a different ap-proach of the LSM score was used, which will be assessed in Section 5.2.

5.2 LSM integration methods

The integration of the LSM objective was addresses with two approaches. The first ap-proach was an integration in the loss function during training and the second apap-proach was implemented using weighted decoding. Both methods could be applied simultane-ously, because the first one operated during training time and the second one during validation time and inference.

5.2.1 Loss integration

When training the baseline model, a multi-task loss combining language modeling with next-sentence prediction was implemented. For the language modeling (LM) loss, cross-entropy was applied to the model’s predicted output logits y and the target response distribution ˆy (see Equation 5.2).

Losslm(y, ˆy) = − V

X

i

(16)

For the next-sentence prediction (MC) loss, cross-entropy loss was used as well, but now within a classification task, as can be seen in Equation 5.3, where x corresponds to the model’s output probability distribution of the classification among N = 4 possible next sentences and c corresponds to the number of the target next-sentence (or class).

Lossmc(x, c) = − log exp(x[c]) PN j exp(x[j]) ! (5.3)

Finally, these losses were combined into a total multi-task loss and weighted with hyper-parameters αlm and αmc in Equation 5.4

Losstotal= αlmLosslm+ αmcLossmc (5.4)

In this first approach, the multi-task loss was extended by adding a LSM-based loss. Consider a training sample with a context c and dialogue history x_h, which contains ut-terances of speaker one (the model) and speaker two (the conversation partner). Feeding such a sample in the model produces a logit word distribution y of the response-sentence of speaker one, where each y_i has the size of the vocabulary.

The history was encoded into a distribution using the tokenizer, as well as the dictionary containing the function words. The history was split up into two parts, x_h1and x_h2, each respectively belonging to a speaker. Due to the use of distributions, it was not possible to count the number of function words. Instead, each word distribution in the histories xh1and xh2 and output y was compared with the distribution of a specific function word

category using KL divergence, which eventually assigned a value to each word of how ‘likely’ it was to belong to a specific function word category. Consider a word distribution yi from the output’s logit distribution. In order to measure how likely this word was to

be, for instance, an article, it was compared to the distribution of articles d_a from the encoded dictionary. The comparison was done following Equation 5.5, where eventually the inverse of the KL divergence was taken, in order to make a high KL divergence value correspond to a high ‘chance’ of being an article.

inverse KL_div(yi, da) =     V

X

j dj_alog d j a y_ij !     −1 (5.5)

The KL divergence measure was calculated for every word distribution in the output and history parts. Eventually, the mean was taken for each function word category, in order to create an approximation of how ‘much’ specific function words the output and history parts contained. This resulted in a mean vector representing the usage of each function word category in the output y and histories x_h1 and x_h2, which replaced the percentage that was used in the original LSM equation in Equation 5.1. The mean vector of y and xh1 were combined into a new mean vector representing chatbot’s (speaker one) usage of

function words. The higher this inverse KL divergence mean of a specific function word category was, the more that function word category was represented in the text.

(17)

Subsequently, using the mean vectors m1and m2for speaker one and speaker two

respec-tively, the LSM score for function word category d was calculated with a modification of Equation 5.1, where p₁ and p₂ were replaced by the speakers’ corresponding means for that category. Additionally, the ’1−’ part was removed in order to make it a loss instead of a score, see Equation 5.6. A multiplication by 100 was done to map the output of the loss function from range [0, 1] to range [0, 100].

Lossd_lsm= 100 · |m

d 1− md2|

md₁+ md₂+ 0.00000001 (5.6) Finally, the mean of all categorized losses was calculated, which resulted in a final LSM loss that was added to Equation 5.4 with hyperparameter α_lsm as follows:

Losstotal= αlmLosslm+ αmcLossmc+ αlsmLosslsm (5.7)

The alpha hyperparameters were tunable and given as input in the training command. The goal of this method was to let the model inherently learn to adapt to the language style of the conversation partner.

5.2.2 Weighted decoding

Weighted decoding, introduced by Ghazvininejad et al. (2017), is a method to increase or decrease probabilities based on specifically calculated features during decoding. This method was also studied by See et al. (2019), where they applied it to address sev-eral problems in dialogue response generation like repetition, specificity and response-relatedness.

As mentioned before, this approach was implemented during validation time (or infer-ence) and did not need any implementation during training, which made it possible to effortlessly combine it with the loss integration approach addressed in Section 5.2.1. During decoding, the next sentence is being sampled word-by-word using a specific sam-pling method like greedy search, beam search or top-k samsam-pling. At timestep t during decoding, y_t is the word that has to be sampled and y_<t represents the words that are already sampled. So for each possible next word ytin the vocabulary, the log probability

log Pmodel(yt|y<t, xh, c) is calculated by the model based on the dialogue history xh,

ad-ditional context c and the previously sampled words y_<t. With weighted decoding, the log probabilities of each possible next word in the vocabulary at timestep t are modified with a specific feature, which is usually calculated based on the dialogue history x_h or sampling history y_<t. Such features can be measures like the cosine similarity between responses of speakers in the history, but also real values like the Normalized Inversed Document Frequency (See et al., 2019).

In this approach, LSM was used as the only feature in weighted decoding. Based on the history of the dialogue xh and the partial hypothesis y<t of the model, a LSM measure

increased or decreased the probabilities of specific function words. Similarly as in the loss integration approach, the dialogue history x_h was split up into the history of the chatbot xh1 and the history of the conversation partner xh2. Instead of calculating the

(18)

LSM score using Equation 5.1, only the percentages p representing the use of specific function words were used. Furthermore, the chatbot’s history x_h1 and the chatbot’s partial hypothesis y_<t were combined and for each function word category, the percent-age of uspercent-age, p1, was calculated. The same was applied to just the history xh2, which

resulted in the percentage p2 for each function word category. Eventually the differences

between those percentages were added to the log probability of the words corresponding to a specific function word category d as follows:

log Pmodel(ytd|y<t, xh, c) + w(p2− p1) (5.8)

Where w is a tunable weight parameter for the LSM feature and y_td corresponds to a word from a function word category d.

So for example, if the chatbot used 11% adverbs in (xh1, y<t) and the conversation

partner used 20% adverbs in x_h2, the increase of the log probabilities of adverbs in the vocabulary at timestep t was 5.0 × (0.20 − 0.11) = 0.45, assuming w = 5.0.

(19)

6 Experimental Evaluation

6.1 Dataset: ConvAI2 PersonaChat

The dataset that was used is the PersonaChat dialogue dataset (Zhang et al., 2018) from the NEURIPS ConvAI2 Challenge in 2018 (Dinan et al., 2019). The PersonaChat dataset consists of conversations between randomly paired crowdworkers. Each crowdworker was provided with a randomly assigned persona, which is a characterization of a fictional person. An example of a persona could be: ‘I am 25 years old and like to play tennis. I also love to read biographies and magazines. I am feeling happy right now. I want an icecream. I like to listen to jazz music.’. The crowdworkers were asked to act like their personas and chat naturally in order to get to know each other throughout the conversation for six to eight turns. This produces chit-chat dialogues that aim to mimic a natural conversation between two different people when they first meet and get to know each other. The produced dialogues mainly consist of asking and answering questions about each other’s persona. An example of such a dialogue can be seen in Figure 3.

Figure 3: An example of a dialogue in the PersonaChat dataset, including personas (Dinan et al., 2019)

The PersonaChat also tried to address common problems with chit-chat models like an inconsistent personality (Li et al., 2016), the production of answers that are not specific like ’I don’t know’ (Li et al., 2015) and an incompetent long-term memory (Vinyals & Le, 2015).

The entire dataset contains 162,064 utterances within more than 10,907 dialogues. Each speaker pair was assigned with personas from a set of 1155 personas, where each one existed of at least four sentences. The dataset was split up into a training set containing

(20)

131,438 utterances (8939 dialogues), a validation set containing 15,602 utterances (1000 dialogues) and a test set with 150,024 utterances (968 dialogues) (Zhang et al., 2018).

6.2 Baseline model

The model used for the integration of the language style matching objectives was a pre-trained Transformer-based GPT-2 model. The model was pre-pre-trained by OpenAI on 40GB of internet text from Reddit. The original model contains 1.5 billion parameters, but in this project, the medium-sized GPT2-medium model with 345 million parameter was used. Additionally, a pre-trained GPT-2 tokenizer was used for encoding and decod-ing sequences.

To implement and fine-tune the baseline model, the PyTorch implementation of the GPT-2 Double-Heads model and the GPT-2 tokenizer from HuggingFace’s Transformer library1 was used. This Double-Heads model uses two heads with different tasks and losses. As mentioned before, one head computes the language model predictions and its cross-entropy loss, the other head predicts the next-sentence using a classification task and also computes its cross-entropy loss. This next-sentence prediction helped the model to look at global meanings as well, instead of the local contexts only.

The Double-Heads GPT2 model was fine-tuned on the ConvAI2 PersonaChat dataset following the implementation of HuggingFace’s Conversational AI2. The personas in the PersonaChat were used as additional context c and α_lm and α_mc were set to 2.0 and 1.0 respectively. This fine-tuned model served as the baseline model for this research. The attempt was to improve over this baseline in terms of language style matching, using the methods proposed in Section 5.2.

6.3 Experimental setup

6.3.1 Model setup

The baseline GPT-2 DoubleHeads model was fine-tuned on the PersonaChat dataset within one epoch. The baseline models with the integrated LSM loss were fine-tuned within one epoch as well. The fine-tuning was done in parallel using four external NVIDIA Titan RTX 24GB GPUs from the SURFsara Lisa GPU cluster.

This resulted in five model files, one containing the weights of the baseline model and the others containing the weights of the models with the LSM loss integration for each hyperparameter α_lsm. The hyperparameters α_lm and αmc for the multi-task loss were

set to 2.0 and 1.0 respectively for all models.

Eventually, four types of models were compared: the baseline model, the baseline model with LSM loss integration, the baseline model with weighted decoding and the baseline model with both LSM loss integration and weighted decoding.

When sampling during evaluation on the validation set of the ConvAI2 PersonaChat dataset, top-k sampling in combination with top-p (nucleus) sampling was used, which

1_{https://github.com/huggingface/transformers} 2

(21)

has been found to be the most promising sampling method for this kind of high-entropy tasks (Holtzman et al., 2019). This combined sampling method samples from the next-word distribution after filtering this distribution with the top-k next-words and then filters these top-k tokens again on a cumulative probability that has to be above a specific threshold (top-p). The top-k value was set to 20 during evaluation, the top-p value to 0.9 and the temperature was set to 0.7. Furthermore, the maximum history length that was used during sampling was set to 2 and the maximum sentence length of the produced output was set to 20.

6.3.2 Evaluation

Several model configurations were evaluated. Every configuration was set with different values for the hyperparameters α_lsm and w of the LSM integration approaches.

For each model configuration, a metric evaluation with F1 and Hits@1 scores was carried out using ParlAI’s ConvAI2 evaluation script. The perplexity of each model was already calculated in an after-training validation. Additionally, two LSM metrics using Equation 5.1 were calculated for each model configuration. The first metric calculated the total LSM score between utterances of speaker one containing only the model’s (chatbot) responses and the utterances of speaker two (conversation partner) in the validation set. This metric enabled comparing all model configurations based on language style matching.

The second metric calculated the total LSM score between the utterances of speaker one containing only the (human) target responses and the utterances of speaker two (conversation partner) in the validation set. This second LSM metric enabled making judgments about how human the model’s LSM-integrated responses were, by comparing it to the first LSM metric.

Lastly, some dialogue examples were generated by the best and worse performing models using the interaction script to attempt to show the influence of the hyperparameters of each LSM objective. During the interactive evaluation, the top-k value was set to 40, the top-p set to 0.9, the temperature set to 0.7 and the maximum history length and maximum output sentence length were set to 2 and 20 respectively.

The implementation of the LSM objectives and the scripts to reproduce the results can be found in the public GitHub repository of this research3.

(22)

7 Results

7.1 Quantitative results

The evaluation results of each model configuration for the perplexity, F1, Hits@1 and LSM metrics can be seen in Table 1.

Model configuration PPL F1 Hits@1 LSMmodel LSMhuman

Baseline 11.673 0.1736 0.8427 0.6745 0.6690 αlsm = 1.0 11.691 0.1728 0.8401 0.6737 0.6690 αlsm = 2.0 11.658 0.1730 0.8467 0.6714 0.6690 αlsm = 3.0 11.648 0.1752 0.8418 0.6744 0.6690 αlsm = 4.0 11.685 0.1714 0.8445 0.6733 0.6690 w = 2.5 11.673 0.1733 0.8427 0.6799 0.6690 w = 5.0 11.673 0.1750 0.8427 0.6874 0.6690 w = 7.5 11.673 0.1719 0.8427 0.6923 0.6690 w = 10.0 11.673 0.1711 0.8427 0.6991 0.6690 αlsm = 1.0, w = 2.5 11.691 0.1722 0.8401 0.6799 0.6690 αlsm = 2.0, w = 2.5 11.658 0.1740 0.8467 0.6799 0.6690 αlsm = 3.0, w = 2.5 11.648 0.1766 0.8418 0.6801 0.6690 αlsm = 4.0, w = 2.5 11.685 0.1731 0.8445 0.6791 0.6690 αlsm = 1.0, w = 5.0 11.691 0.1755 0.8401 0.6862 0.6690 αlsm = 2.0, w = 5.0 11.658 0.1733 0.8467 0.6866 0.6690 αlsm = 3.0, w = 5.0 11.648 0.1747 0.8418 0.6871 0.6690 αlsm = 4.0, w = 5.0 11.685 0.1725 0.8445 0.6858 0.6690 αlsm = 1.0, w = 7.5 11.691 0.1717 0.8401 0.6916 0.6690 αlsm = 2.0, w = 7.5 11.658 0.1742 0.8467 0.6935 0.6690 αlsm = 3.0, w = 7.5 11.648 0.1730 0.8418 0.6922 0.6690 αlsm = 4.0, w = 7.5 11.685 0.1723 0.8445 0.6919 0.6690 αlsm = 1.0, w = 10.0 11.691 0.1714 0.8401 0.6982 0.6690 αlsm = 2.0, w = 10.0 11.658 0.1701 0.8467 0.6988 0.6690 αlsm = 3.0, w = 10.0 11.648 0.1741 0.8418 0.6977 0.6690 αlsm = 4.0, w = 10.0 11.685 0.1695 0.8445 0.6981 0.6690

Table 1: Evaluation metrics for each model configuration. The numbers that outperform the baseline model are highlighted.

All models that use LSM-based weighted decoding outperform the baseline model in terms of the LSMscore. The model configurations with αlsm = 2.0, w = 2.5 and αlsm =

2.0, w = 7.5 outperform the baseline model on every metric. The model configuration with w = 10.0 has the highest LSM, with an LSM_model of 0.6991, where the combined model with αlsm = 3.0, w = 2.5 performs best in terms of accuracy, with an F1-score of

0.1766. The best-performing models in terms of perplexity are the models with an α_lsm of 3.0, with a lowest perplexity of 11.648. The model configurations that perform the

(23)

best on the Hits@1 score are the models with αlsm set to 2.0.

Subsequently, the model-independent LSM_human is logically the same for every model. The model with α_lsm= 3.0 comes the closest to the human LSM score from the validation set, with a LSMmodel of 0.6714.

Lastly, the model configuration that performs the best on all metrics together is the model with α_lsm= 2.0, w = 7.5.

7.2 Dialogue response examples

In this subsection, some interactive 6-turn dialogue examples of the baseline and best metrical-performing models are shown. All interactive dialogue examples used the same persona: ‘i watch basketball. i go to a local college. i work at a smoothie shop. i listen to classic rock.’ Additionally, all examples tried to keep the input the same as much as possible, to attempt to expose the differences between the model configurations.

In the figures, P corresponds to the real person talking and M to the model’s response. In Figure 4a and 4b, example dialogues are shown of the baseline model and the baseline model integrated with weighted decoding where w = 10.0.

Baseline

P: hello, how are you doing? M: i’m good how are you?

P: i’m good, i’m listening to music. M: what is your favorite type of music? P: hip hop and yours?

M: i like classical rock.

P: cool! who is your favourite artist? M: i like the blues, i’m a fan of the blues. P: nice, what do you do for a living? M: i’m a cashier at a smoothie shop. P: do you have any hobbies?

M: i like to listen to rock music

(a)

Baseline + w = 10.0 P: hello, how are you doing? M: i am doing well and you? P: i’m good, i’m listening to music. M: what are you listening to? P: Hip hop and you?

M: i love classic rock and classical music. P: cool! who is your favourite artist? M: i like the blues, i’m a fan of the blues. P: nice, what do you do for a living? M: i’m a cashier at a smoothie shop. P: do you have any hobbies?

M: i do a lot of music, especially classic rock.

(b)

(24)

Raising w to an extremely high value like 100.0, causes the model to become extremely inaccurate and lose its control, as can be seen in Figure 5.

w = 100.0

P: hello, how are you doing?

M: was no no no no no no no no no no no no no no no no no no no P: i’m good, i’m listening to music.

M: no no no no no no no no no no no no no no no no no no no no P: Hip hop and you?

M: no no no no no no no no no no no no no no no no no no no no P: cool! who is your favourite artist?

M: no no no no no no no no no no no no no no no no no no no no P: nice, what do you do for a living?

M: no no no no no no no no no no no no no no no no no no no no P: do you have any hobbies?

M: no no no no no no no no no no no no no no no no no no no no

Figure 5: Dialogue example for a model configuration with an extremely high w In Figure 6a and 6b the example dialogues for the best combined model configurations are shown.

αlsm = 2.0, w = 2.5

P: hello, how are you doing?

M: i’m good just listening to classic rock, what about you?

P: i’m good, i’m listening to music too! M: what kind of music do you like? P: hip hop and you?

M: i like classic rock. do you work? P: yes, i am a lawyer. what do you do for a living?

M: i work at a smoothie shop. P: do you have any hobbies? M: i do, i like to play basketball. P: who is your favorite artist?

M: i like all artists. i listen to rock music.

(a)

αlsm = 2.0, w = 7.5

P: hello, how are you doing? M: i am doing well, how are you? P: i’m good, i’m listening to music. M: that is good. i am watching basketball.

P: nice, what do you do for a living? M: i work at a smoothie shop. P: do you have any hobbies? M: i like to watch the basketball. P: do you have any more hobbies? M: i like listening to classic rock. P: cool! who is your favorite artist?

M: my favorite artist are the blues

(b)

(25)

8 Conclusion and Discussion

This project was undertaken to explore, design and evaluate the possibilities of integrat-ing language style matchintegrat-ing into a dialogue response generation model. The integration of language style matching was implemented with two different approaches. The first approach was integrating a LSM objective into the loss function, attempting to let the model inherently learn to adapt to the conversation partner’s language style when gen-erating responses. The second approach was done by integrating a LSM objective in weighted decoding, where the model was ‘forced’ to increase or decrease the use of spe-cific function words based on the dialogue history. Additionally, both methods were combined and evaluated as well.

The purpose of this study was to determine the best technique for integrating a language style matching objective into a state-of-the-art dialogue response generation model. The results of this study indicate that the best way to integrate LSM is by using weighted decoding. All model configurations that used weighted decoding showed an improvement over the baseline in terms of LSM, while the model configurations using only the loss integration failed to improve over the baseline in terms of the LSM score. The approach using loss integration only, failed to inherently learn to match the conversation partner’s language style. However, some configurations of this method did improve over the base-line in terms of perplexity and accuracy, while the configurations with only weighted decoding deteriorated in accuracy. These findings bring us to the third approach of com-bining both methods. When comcom-bining both methods, a model configuration with slight improvements in terms of both accuracy and language style matching arised.

However, this leads us to a trade-off. The question is whether you want a dialogue re-sponse generation model with a stable accuracy or a high LSM score. On the one hand, when rising the weight in weighted decoding, the accuracy will drop. On the other hand, using a model with for example only an αlsm of 3.0 improves accuracy, but when rising

the α_lsm in general, the accuracy and perplexity don’t necessarily get higher. Thus, the trade-off seems to essentially lie in value of w, where rising it improves LSM and deteri-orates accuracy, but lowering it decreases LSM and stabilizes accuracy.

Furthermore, all model configurations in the metric table on average generate responses with a LSM_model score higher than the corresponding LSM_human score. Since the dif-ferences between the LSMmodel and LSMhuman scores are relatively small, the model

seems to generate responses according to human-like language style matching. However, if the value of w will be raised too much, the differences will probably become critical. Finally, considering both methods and their results compared to the baseline, the use of weighted decoding is the answer to the research question. Although it may drop some accuracy, a model trained with an α_lsm set to 3.0 can possibly neutralize a small deteri-oration in accuracy.

Whereas this research attempted to create an improvement over a state-of-the-art base-line model and although weighted decoding turned out to be the best approach, the

(26)

metrical results do not show major improvements over the baseline in terms of language style matching, but do remain human. The improvements don’t seem to be significant enough to actually show practical LSM improvements when interactively chatting with the model (see Section 7.2). To create a more visible result in terms of dialogue genera-tion examples, one could raise the weight of the weighted decoding even more, but then the models will drop in accuracy, which again leads us to the aforementioned trade-off. Furthermore, whereas this project focused on integrating a language style matching ob-jective, it did not focus on the dialogue response generation model itself. The model used a state-of-the-art architecture with basic parameter values for sampling and lan-guage modeling loss. However, the dialogue response generation model contained a lot of hyperparameters that could be tuned, like the values for top-k, top-p and maximum history, but also the αlm and αmc parameters in the multi-task loss. Further research

should be undertaken to investigate the influence of these hyperparameters and estimate their optimal value. Estimating such parameters will require a lot of computational GPU power, but eventually, when their optimal values are estimates, a model configuration arises that generates accurate responses with a stable accuracy and a LSM score as high as possible. Additionally, in future research an alternative measure to calculate LSM could be explored and integrated into the model. A more complex measure like rLSM, as proposed by Müller-Frommeyer et al. (2018), could be established into the models. This measure takes temporal reciprocity into account when measuring LSM, but still shows high conceptual similarity with the original LSM measure.

Furthermore, the possibilities of weighted decoding could be further explored by involving more features than the LSM score only in weighted decoding. For example, the weighted decoding features proposed by See et al. (2019) could be used in order to improve the language model itself.

Finally, the findings of this research provide insights in the integration of language style matching into dialogue response generation models and suggest a functioning method to implement this. However, a lot of hyperparameters need to be fine-tuned in order to get an optimal performing model that can be implemented in an operational chatbot.

(27)

References

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv, 1409.

Bayram, A. B. & Ta, V. (2018). Diplomatic chameleons: Language style matching and agreement in international diplomatic negotiations. Negotiation and Conflict Manage-ment Research.

Bengio, Y., Frasconi, P., & Simard, P. (1993). : (pp. 1183 – 1188 vol.3).

Borji, A. & Itti, L. (2012). State-of-the-art in visual attention modeling. IEEE transac-tions on pattern analysis and machine intelligence, 35.

Bowen, J. D., Winczewski, L. A., & Collins, N. L. (2017). Language style matching in romantic partners’ conflict and support interactions. Journal of Language and Social Psychology, 36(3), 263–286.

Chartrand, T. & van Baaren, R. (2009). Human mimicry. Advances in Experimental Social Psychology - ADVAN EXP SOC PSYCHOL, 41, 219–274.

Chartrand, T. L. & Bargh, J. A. (1999). The chameleon effect: the perception-behavior link and social interaction. Journal of personality and social psychology, 76 6, 893–910. Chung, C. K. & Pennebaker, J. W. (2007). The psychological functions of function words. Devlin, J., Chang, M., Lee, K., & Toutanova, K. (2018). BERT: pre-training of deep

bidirectional transformers for language understanding. CoRR, abs/1810.04805. Dinan, E., Logacheva, V., Malykh, V., Miller, A. H., Shuster, K., Urbanek, J., Kiela,

D., Szlam, A., Serban, I., Lowe, R., Prabhumoye, S., Black, A. W., Rudnicky, A. I., Williams, J., Pineau, J., Burtsev, M., & Weston, J. (2019). The second conversational intelligence challenge (convai2). CoRR, abs/1902.00098.

Ghazvininejad, M., Shi, X., Priyadarshi, J., & Knight, K. (2017). Hafez: an interactive poetry generation system. In Proceedings of ACL 2017, System Demonstrations (pp. 43–48). Vancouver, Canada: Association for Computational Linguistics.

Hatfield, E., Cacioppo, J. T., & Rapson, R. L. (1993). Emotional contagion. Current Directions in Psychological Science, 2(3), 96–100.

Holtzman, A., Buys, J., Forbes, M., & Choi, Y. (2019). The curious case of neural text degeneration. CoRR, abs/1904.09751.

Ireland, M. & Pennebaker, J. (2010). Language style matching in writing: Synchrony in essays, correspondence, and poetry. Journal of personality and social psychology, 99, 549–71.

(28)

Kacewicz, E., Pennebaker, J. W., Davis, M., Jeon, M., & Graesser, A. C. (2014). Pronoun use reflects standings in social hierarchies. Journal of Language and Social Psychology, 33(2), 125–143.

Kulesza, W., Dolinski, D., Majewski, R., & Huisman, A. (2014). The echo effect: The power of verbal mimicry to influence pro-social behavior. Journal of Language and Social Psychology, 33, 182–201.

Lakin, J., Chartrand, T., & Arkin, R. (2008). I am too just like you nonconscious mimicry as an automatic behavioral response to social exclusion. Psychological science, 19, 816– 22.

Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2015). A diversity-promoting objective function for neural conversation models. CoRR, abs/1510.03055.

Li, J., Galley, M., Brockett, C., Gao, J., & Dolan, B. (2016). A persona-based neural conversation model. CoRR, abs/1603.06155.

Lord, S., Sheng, E., Imel, Z., Baer, J., & Atkins, D. (2014). More than reflections: Empa-thy in motivational interviewing includes language style synchrony between therapist and client. Behavior Therapy, 46.

McFarland, D. (2001). Respiratory markers of conversational interaction. Journal of speech, language, and hearing research : JSLHR, 44, 128–43.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of Workshop at ICLR, 2013.

Müller-Frommeyer, L., Frommeyer, N., & Kauffeld, S. (2018). Introducing rlsm: An integrated metric assessing language style matching in dyadic interaction. behavior research methods. Behavior Research Methods, (pp. 1–17).

Newman, M. L., Pennebaker, J. W., Berry, D. S., & Richards, J. M. (2003). Lying words: Predicting deception from linguistic styles. Personality and Social Psychology Bulletin, 29(5), 665–675. PMID: 15272998.

Niederhoffer, K. G. & Pennebaker, J. W. (2002). Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21(4), 337–360.

Noble, B. & Fernández, R. (2015). : (pp. 29–38).

Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, volume 14 (pp. 1532–1543).

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners.

(29)

Ramanathan, S. & McGill, A. L. (2007). Consuming with Others: Social Influences on Moment-to-Moment and Retrospective Evaluations of an Experience. Journal of Consumer Research, 34(4), 506–524.

Rude, S., Gortner, E.-M., & Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition Emotion - COGNITION EMO-TION, 18, 1121–1133.

See, A., Roller, S., Kiela, D., & Weston, J. (2019). What makes a good conversation? how controllable attributes affect human judgments. CoRR, abs/1902.08654.

Stel, M. & Vonk, R. (2009). Mimicry in social interaction: Benefits for mimickers, mimickees, and their interaction. British journal of psychology (London, England : 1953), 101, 311–23.

Summerfield, J., Lepsien, J., Gitelman, D., Mesulam, M., & Nobre, A. (2006). Orienting attention based on long-term memory experience. Neuron, 49, 905–16.

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. CoRR, abs/1409.3215.

Tan, C., Sun, F., Kong, T., Zhang, W., Yang, C., & Liu, C. (2018). A survey on deep transfer learning. CoRR, abs/1808.01974.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., & Polosukhin, I. (2017). Attention is all you need. (pp. 5998–6008).

Vinyals, O. & Le, Q. V. (2015). A neural conversational model. CoRR, abs/1506.05869. Wen, T., Gasic, M., Mrksic, N., Su, P., Vandyke, D., & Young, S. J. (2015). Semanti-cally conditioned lstm-based natural language generation for spoken dialogue systems. CoRR, abs/1508.01745.

Yabar, Y., Johnston, L., Miles, L., & Peace, V. (2006). Implicit behavioral mimicry: Investigating the impact of group membership. Journal of Nonverbal Behavior, 30(3), 97–113.

Yilmaz, G. (2016). What you do and how you speak matter: Behavioral and linguis-tic determinants of performance in virtual teams. Journal of Language and Social Psychology, 35(1), 76–97.

Zhang, S., Dinan, E., Urbanek, J., Szlam, A., Kiela, D., & Weston, J. (2018). Personal-izing dialogue agents: I have a dog, do you have pets too? CoRR, abs/1801.07243.

(30)

A

Function word dictionary

Below, the function words belonging to each function word category are shown. The function words were compiled based on the BN Corpus during NAACL’15 (Noble & Fer-nández, 2015)

’conjunctions’: [’and’, ’but’, ’or’, ’as’, ’if’, ’when’, ’because’, ’while’, ’where’, ’although’, ’whether’, ’before’, ’since’, ’so’, ’though’, ’until’, ’after’, ’cos’, ’for’, ’&’, ’nor’, ’unless’, ’once’, ’whereas’, ’whilst’, ’rather than’, ’and/or’, ’even when’, ’albeit’, ’given that’, ’provided that’],

’auxiliary_verbs’: [’be’, ’is’, ’are’, ’were’, ’was’, ’been’, ’am’, ’being’, ’have’, ’has’, ’was’, ’were’, ’would’, ’will’, ’do’, ’can’, ’could’, ’dare’, ’does’, ’did’, ’had’, ’having’, ’may’, ’might’, ’must’, ’need’, ’ought’, ’shall’, ’should’, "’ll", "’d"], ’personal_pronouns’: [’i’, ’you’, ’he’, ’they’, ’she’, ’we’, ’who’,

’them’, ’him’, ’me’, ’her’, ’us’, ’himself’, ’themselves’, ’someone’, ’herself’, ’anyone’, ’everyone’, ’whom’, ’myself’, ’each other’, ’yourself’, ’no one’, ’somebody’, ’nobody’, ’everybody’, ’anybody’, ’his’, ’mine’,

’ourselves’, ’yours’, ’one another’, ’hers’, ’no-one’, ’ours’, ’theirs’, ’his’, ’their’, ’her’, ’my’, ’your’, ’our’],

’impersonal_pronouns’: [’it’, ’its’, ’they’, ’that’, ’this’, ’them’, ’something’, ’nothing’, ’anything’, ’itself’, ’themselves’, ’itself’, ’everything’, "’em", ’each other’, ’everything’, ’something’],

’quantifiers’: [’all’, ’some’, ’any’, ’many’, ’more’, ’another’, ’much’, ’each’, ’few’, ’most’, ’both’, ’several’, ’half’, ’little’, ’whatever’, ’less’, ’enough’, ’either’, ’fewer’, ’neither’, ’a lot’, ’least’, ’a bit’, ’a great deal’, ’plenty’],

(31)

’prepositions’: [’of’, ’in’, ’to’, ’for’, ’with’, ’on’, ’by’, ’at’, ’from’, ’as’, ’into’, ’about’, ’like’, ’after’, ’between’, ’through’, ’over’, ’against’, ’under’, ’without’, ’within’, ’during’, ’before’, ’towards’, ’around’, ’upon’, ’including’, ’among’, ’across’, ’off’, ’behind’, ’since’, ’rather than’, ’until’, ’according to’, ’up to’, ’despite’, ’near’, ’above’, ’per’, ’along’, ’away from’, ’throughout’, ’outside’, ’round’, ’beyond’, ’worth’, ’down’, ’on to’, ’up’, ’due to’, ’inside’, ’plus’],

’adverbs’: [’so’, ’up’, ’then’, ’out’, ’now’, ’only’, ’just’, ’more’, ’also’, ’very’, ’well’, ’how’, ’down’, ’back’, ’on’, ’there’, ’still’, ’even’, ’too’, ’here’, ’where’,

’however’, ’over’, ’in’, ’as’, ’most’, ’again’, ’never’, ’why’, ’off’, ’really’, ’always’, ’about’, ’when’,

’quite’, ’much’, ’both’, ’often’, ’away’, ’perhaps’, ’right’, ’already’, ’yet’, ’later’, ’almost’, ’of course’, ’far’, ’together’, ’probably’, ’today’, ’actually’, ’ever’, ’at least’, ’enough’, ’less’, ’for example’, ’therefore’, ’particularly’, ’either’, ’around’, ’rather’, ’else’, ’sometimes’, ’thus’, ’ago’, ’yesterday’, ’home’, ’all’, ’usually’]}

Integrating a Language Style Matching Objective into Deep Neural Networks for Dialogue Response Generation

Integrating a Language Style Matching

Objective into Deep Neural Networks for

Dialogue Response Generation

Integrating a Language Style Matching Objective

into Deep Neural Networks for Dialogue

Response Generation

Acknowledgements

Abstract

Contents

1

Introduction

2

Dialogue Response Generation Task

3

Background

4

Related Work

5

Methods

X

6

Experimental Evaluation

7

Results

8

Conclusion and Discussion

References

A

Function word dictionary