Resolving references in memory-annotated dialogue through semantics

(1)

Resolving references in

memory-annotated dialogue

through semantics

Claartje Barkhof 11035129 Bachelor thesis Credits: 18 EC

Bachelor B`eta-Gamma, Artificial Intelligence major University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor dhr. dr. T.O. Lentz Institute for Language and Logic

Faculty of Science University of Amsterdam

Science Park 904 1098 XH Amsterdam

(2)

Abstract

In human-to-human conversation, resolving referencing is a complex, but in-tuitive process. Resolving references entails linking utterances to previously discussed topics. For a dialogue system to be able to exhibit such behaviour is an important step towards a computer conversing in a natural way about any topic. This thesis investigates whether deep learning architectures can approach such a task making use of distributional semantics as main input, which minim-ises the need for feature engineering and, therefore, maximminim-ises scalability. The Maluuba Frames corpus, annotated with semantic frames (i.e. formal termino-logy for topics, or user goals in goal-oriented dialogue) and references, is used as training data. A siamese recurrent architecture, referred to as the MaLSTM model, proposed by Mueller and Thyagarajan (2016) and trained for predicting semantic similarity between sentences, is used as a starting point for this ex-ploration. The main reason to use this model is that resolving references can be interpreted as a semantic similarity evaluation, because referencing utterances carry direct or indirect semantic information about the frame that is referenced. In Experiment 1, a template-based model, making use of the MaLSTM for pro-cessing utterances, is presented. In Experiment 2, an asymmetrical adaptation of the MaLSTM is implemented. The model of experiment 2 resolves references annotated in the Maluuba frames corpus with an accuracy of 96%, a precision of 92% and a recall of 89%. All in all, this thesis presents a successful first step away from rule-based reference resolving systems towards data-driven systems requiring no feature engineering at all.

(4)

1 Introduction

1.1 Dialogue systems

Dialogue systems, also known as interactive conversational agents, are systems designed to converse with humans in natural language (Jurafsky & Martin, 2017; Serban, Lowe, Henderson, Charlin, & Pineau, 2015). Famous examples of dialogue systems are Apple’s SIRI, Google Assistant and Amazon’s Alexa. Although spoken dialogue systems are practical and have become popularised in hands-free environments (Asri et al., 2017), this thesis focuses on dialogue systems that converse in a textual manner only. It is valuable to do so, because dialogue systems often occur as text-based agents, which are generally referred to as chatbots, and, it is a way to bypass issues connected to speech analysis and synthesis, such as errors that occur with recognising user input (Swerts, Litman, & Hirschberg, 2000).

Moreover, this thesis mainly deals with multi-turn dialogue. Organising a dia-logue in a systematic way, a turn is defined as a conversational unit, a move from one agent in a dialogue. A multi-turn dialogue thus consists multiple conversa-tional moves from the involved agents. Turns in a chat dialogue, for instance, would be defined by the self-contained messages constituting the chat. In lin-guistics, the content of a turn, an uninterrupted sequence of text or speech, is called an utterance.

Dialogue systems can roughly be divided into two sub-classes: goal-oriented and non-goal-oriented systems. The former assists users with satisfying an informa-tion need or with performing a task. This kind of dialogue system often operates in a slot-filling manner: constraints are accumulated through conversation, in turn leading to a query which can be used to retrieve the desired information from a knowledge base or carry out a specific task (Asri et al., 2017). An illus-trative example of a goal-oriented system is a chatbot assisting with booking a restaurant, filling slots such as ‘type of food’ and ‘price range’ until it has accu-mulated enough constraints to query a database and return a list of restaurants that meet those constraints. The latter of the two aforementioned classes is a class of dialogue systems that converse in a more open manner, without trying to reach a certain goal. An example of a non-goal-oriented system is ELIZA, a chatbot mimicking a Rogerian psychotherapist (Weizenbaum, 1966).

An important, but complex task for goal-oriented dialogue systems is keeping track of topics discussed throughout a dialogue. A shared conception of context and topic is crucial for a natural conversational interaction (Grosz & Sidner, 1986). Dialogue systems therefore need a unit for tracking what has happened in the conversation, keeping in memory a context from preceding turns (Henderson, 2015). A desirable ability for dialogue systems related to keeping a notion of context is resolving references to previously discussed topics. References can appear in different forms, ranging from concrete textual anaphora (e.g. demonstrative pronouns) to more abstract references such as follow-up question that relate to topics or entities that have been mentioned in an earlier stage of the dialogue. The task of resolving references is a sub-task of context resolution where user turns are interpreted in the context of the past dialogue, physical

(5)

space, time and knowledge about the world (Filisko & Seneff, 2003).

In human-to-human conversation, resolving references is a complex, but intuitive process. For a dialogue system to be able to exhibit such behaviour is an important step in the direction of a computer to converse in a natural way with a human about any topic (Henderson, 2015; Lowe, Pow, Serban, & Pineau, 2015). A more specific application of dialogue systems with this ability can be found in the field of Information Retrieval. In practice, an information need is rarely expressed in just one question (Chai & Jin, 2004). A multi-turn dialogue can be helpful with meeting complex information needs. To be able to handle such a multi-turn dialogue, resolving references is crucial. Consider the example shown in Figure 1 that demonstrates the limitations of a system that is not able to resolve references.

Figure 1: An example of SIRI failing to handle references in a multi-turn dia-logue (Jurafsky & Martin, 2017)

Another useful application of dialogue systems with this ability would be in situations where a Graphical User Interface would traditionally be functioning (Asri et al., 2017). Websites on which products or services (e.g. flights) can be explored are an example. A chatbot can overcome the limitations of visual space on a screen and would, therefore, be helpful with comparing different options. Going back-and-forth between things requires the ability to resolve references and is essential to satisfying more complex information needs.

(6)

1.2 From rule-based towards data-driven

The aforementioned example of a restaurant booking assistant is a conversa-tional agent that functions within a narrow domain. Such agents are called frame-based agents. Due to the narrow domain, a logical representation of the conversation can in some cases be determined a priori and, if that is the case, the dialogue system can be programmed in a based structure. This rule-based structure can then be seen as the outline of a decision-making process (Serban et al., 2015). Frame-based agents that are programmed in this way, are often used for commercial purposes and are generally shown to work well (Jurafsky & Martin, 2017; Serban et al., 2015). The downside, however, is that a rule-based approach requires a lot of domain-specific handcrafting, which is labour intensive and hinders scaling up to new domains (Bordes, Boureau, & Weston, 2016).

Data-driven approaches, specifically machine learning techniques, can overcome these scaling challenges and have shown to be highly successful in the field of natural language processing (NLP) (Hirschberg & Manning, 2015). This suc-cess is due to the increased availability of large corpora, computing power and improvement of machine learning techniques (Lowe et al., 2015). The most important class of machine learning techniques in the field of NLP is that of (artificial) neural networks (Hirschberg & Manning, 2015). This development towards data-driven techniques has reached the field of dialogue systems more recently and can develop further by cause of the increased availability of large corpora for this specific branch (Lowe et al., 2015) as well as the rise of task com-petitions (Hirschberg & Manning, 2015). Neural networks designed for dialogue systems vary from performing specific tasks which support dialogue systems such as identification of named entities (Hirschberg & Manning, 2015) to sys-tems that function fully data-driven such as end-to-end dialogue syssys-tems that aim at generating responses to user turns with no pre-determined assumption of the dialogue structure (Bordes et al., 2016).

1.3 Memory-enhanced dialogue systems

As argued before, a dialogue system that has the ability to resolve references must have a notion of context. This notion of context can be viewed as a memory of topics that have been discussed. Hence, systems with this ability are called memory-enhanced dialogue systems (Schulz, Zumer, Asri, & Sharma, 2017). A task competition series that targets adding memory to goal-oriented dialogue systems is the Dialogue State Tracking (DST) Challenge presented by Microsoft1_{. The concept of topic in non-goal oriented is analogous to the user’s}

goal in a goal-oriented dialogue. The DST Challenge asks participants to find a way to keep track of the user’s goal (dialogue state) at every moment within the conversation. More specifically, a system is to be built that tracks a distribution over constraints (i.e. slot-value pairs) to determine what the dialogue system should ask or say next to attain a query or task description. DST is a first step in the direction of memory-enhanced goal-oriented dialogue systems, but would not suffice to support resolving references in new turns, because the context

(7)

is continuously being overwritten instead of being built up sequentially. Said differently, one dialogue state is kept in memory at a time. To apply this to the previously described example of a restaurant booking agent: the user would not be able to compare restaurants of different kinds because a previous goal (e.g. restaurants that serve Indian food) is overwritten with a new one (e.g. restaurants that serve Italian food).

Asri et al. (2017) presented the frame tracking task in order to fill this gap. Frame tracking is a generalization of DST. With frame tracking, in contrast to DST, a full sequential history of context, or semantic frames, is being accumu-lated. A semantic frame is defined by a set of constraints posed by a user at a certain point in the dialogue. Because the semantic frames are being built up sequentially, it offers possibilities to develop techniques to resolve references. In Figure 1, in the first turn, a semantic frame is defined by restaurants that are close to the user. In the subsequent user turn, the user refers back to this semantic frame by asking “are any of them Italian”. This reference would be re-solved by linking this question to the frame defined by restaurants in the user’s proximity. If the dialogue system would be able to resolve such a reference, a subset of previously retrieved restaurants (i.e. the ones that are not only close by, but also serve Italian food) could be returned to satisfy the investigative information need of the user.

The frame tracking task is accompanied by an annotated corpus, the Maluuba Frames corpus2. The corpus consists of dialogues which are recorded in a Wizard-of-Oz setting. In a Wizard-of-Oz setting, a human interacts with a dialogue system, which the human believes to be an autonomous computer pro-gram, but in reality is a computer program operated by a human. In this case, the dialogue system was believed to be a goal-oriented frame-based agent which assists with finding vacation packages based on user’s constraints. The wizard had access to a database of vacation packages and the users were in-structed to find packages matching certain conditions, such as destination cities and budgets, through a chat conversation with the wizard. For a more detailed description of the data collection, see the announcement paper by Asri et al. (2017). The Wizard-of-Oz method for data collection has two main advantages: the actual dialogue system does not have to be built and dialogues consist of a more natural interaction than any dialogue system that is currently available would be able to exhibit (Asri et al., 2017).

In support of finding ways to design memory-enhanced dialogue systems, the most important feature of the Maluuba Frames corpus is the annotation of se-mantic frames in parallel with the dialogues. Per turn, a full history of sese-mantic frames that have been created up until that point is being stored. An example of such a frame history is shown in Figure 2.

(8)

Figure 2: An example of a frame history associated with a user turn

1.4 Resolving references: a task description

The Maluuba Frames corpus consists, next to the tracked frame histories, of a variety of annotations, of which references are the most important one for this thesis. References are annotated for user turns and refer to a semantic frame from the frame history of that dialogue. Notable about these references is that they are of varied kinds. Schulz et al. (2017) categorized references occuring in the Maluuba Frames corpus into three main categories:

1. A reference to a semantic frame is made by mentioning slot values directly:

– “Ok, can we take another look at the Mannheim package then?” – “Leaving on the 17th is perfect! What options do I have there?” 2. A reference to a semantic frame is made not by mentioning slot

values directly, but in an indirect way, making use of for example anaphora:

– “That looks great. What are the hotels like?” – “What are the dates for this trip?”

3. A reference that is made implicitly, by for example accepting or declining an offer:

– “No, everything’s ruined now. Thanks though.” – “Yes, please”

Because the corpus consists of dialogues collected in a Wizard-of-Oz setting it is adequate to train data-driven models that attempt to overcome boundaries that are associated with rule-based systems. To be more specific, the dataset is fit

(9)

for training a reference resolving model, since the dataset exhibits exactly that behaviour. Because of this feature, supervised learning can be applied.

The main task that will be approached in this thesis is to resolve references in memory-annotated dialogue and is adapted from the frame tracking challenge (Asri et al., 2017) as follows:

Definition (reference resolving task). At each user turn t that is labelled with a reference to a frame fifrom an accessible frame history

H = {f 1, . . . , fnt−1}, where nt−1 is the number of frames created so

far in the dialogue, the goal is to predict which frame fi is referenced

by the user turn t. A user turn is defined by a textual utterance and a frame by a set of slot types and slot values (constraints).

1.5 Deep learning with distributional semantics

Although neural networks are previously typified as more suitable for overcom-ing scalovercom-ing challenges than rule-based approaches, also neural networks can face scaling obstacles. Conventional neural networks often require careful feature engineering to make sure the objective can be achieved (Chollet, 2017; LeCun, Bengio, & Hinton, 2015). Put differently, the data has to be presented to the model in such a way that it can detect patterns to fulfil a certain task. Deep learning models have the potential to solve this problem. Deep learning models can learn complex functions from large amounts of raw data, due to their multi-level structure and thus have the advantage of requiring little engineering by hand (LeCun et al., 2015). For that reason, this thesis will investigate whether deep learning architectures can be used to resolve references annotated in the Maluuba frames dataset.

The Maluuba frames dataset consists of textual dialogue and annotation. Many conventional neural network models treat words as a basic unit, mapping words to indices and using these indices as input features for the model that is to be trained. Mikolov, Chen, Corrado, and Dean (2013) argue that using such a atomic text treatment makes sense because of robustness and the fact that models can be simple, as long as enough data is used. This way of handling of text is limited, however, in the sense that words bear more information than this atomic handling of words is able to reflect. Mikolov, Chen, et al. (2013) show that it is now possible, with the current knowledge on how to train more complex models with huge amounts of data, to learn vector representations of words. These vectors are contained in distributional semantic models that carry high dimensional information about words. A trained version of such a model is called Word2Vec and is made publicly available3_.

Particularly interesting about vector representations of words is that they reflect multiple degrees of similarity between words. These similarities are in the first instance similarities in their syntactic usage, as one would expect learning these representations making use of contexts of words. The second, less obvious, de-gree of similarity between words is a semantic similarity. A well-known example demonstrating this semantic degree is that the vector representations of ‘king’

(10)

and ‘queen’ differ only in the vector representations of ‘man’ and ‘woman’ (Le & Mikolov, 2014). The fact that vector representations of words reflect syntactic as well as semantic regularities causes them to be adequate for forming the raw data used for deep learning architectures to learn patterns at higher levels of abstraction. This has been demonstrated by successful NLP implementations working with vector representations, such as machine translation (Mikolov, Le, & Sutskever, 2013), information retrieval (Le & Mikolov, 2014), and question answering systems (Yu, Hermann, Blunsom, & Pulman, 2014). Applied to the main task of this thesis, the text from the Maluuba frames corpus transformed into vector representations could be used as input data for models that predict which frames references refer to. It is important to note that this input data requires little engineering (but not none) and therefore can be seen as ‘raw’ data.

1.6 Semantic evaluation

One could argue that resolving references can be seen as a semantic similar-ity evaluation, because referencing utterances carry direct or indirect semantic information about the frame that is referenced. Semantic similarity between bodies of text is of great interest in computational linguistics and remains a dif-ficult task (Mueller & Thyagarajan, 2016). It is a hard task because semantic values of words but also syntactic structures of texts of varying length determ-ine the semantic value of a of a text. Hence, it is not remarkable that a whole competition series, SemEval4_{, is dedicated to evaluating computational systems}

that analyse texts semantically.

An example of an utterance carrying direct semantic information about the frame it references is shown below in Figure 3. This is an example of a refer-ence of type 1 (Section 1.4): there is literal overlap between text elements (i.e. ‘Sendai’ and ‘7’).

Figure 3: An example of a reference of type 1

An example of a reference carrying more indirect information about the frame it references is illustrated in Figure 4. This is an example of a type 2 reference (Section 1.4). These kinds of indirect references are more interesting to approach with distributional semantics, because atomic text treatment features would not suffice to resolve those. Looking at the frames in Figure 4 carefully, one can

(11)

determine that frame 2 is the frame most likely to be referenced by the utterance, because, unlike frame 1, it does not have an ‘origin city’ and has a ‘category’ slot which makes it more likely to be a hotel, while the other frame is more likely to be about a flight.

Figure 4: An example of a reference of type 2

The third reference type of Section 1.4 is expected not to be resolvable with just semantic input data. Other features may have to be added in order to resolve references of this type. Two main features will be examined: recency encoding and active frame encoding. Recency encoding entails giving inform-ation about the order of the semantic frames. Active frame encoding involves clarifying which frame is currently active (i.e. currently discussed). These fea-tures may also help resolving references of type 2: verbal strucfea-tures like ‘the second package...’ or ‘this package...’ would possibly be more easily resolved with information on the order of the frames and information on which frame is currently being discussed.

Furthermore, although adding the above-mentioned features to the input for a deep learning architecture might increase the resolvability of some references, different types are not distinguished a priori and are to be learned simultan-eously, to increase the robustness of the system.

1.7 Siamese recurrent architecture

To investigate this train of thought (interpreting the task of resolving references to frames as a semantic similarity evaluation), it is necessary to explore the applicability and fitness of models built for semantic similarity evaluation to the task.

An interesting architecture proposed by Mueller and Thyagarajan (2016) to evaluate semantic similarity between sentences is a siamese Recurrent Neural Network (RNN) architecture. RNNs are particularly suited for handling data that is inherently sequential (e.g. language). A traditional RNN is not adequate

(12)

for learning dependencies over long sequences. This deficiency has already been reported in 1998 as the ‘vanishing gradient problem’ by Hochreiter. The siamese architecture of Mueller and Thyagarajan (2016), for that reason, implements the Long Short Term Memory (LSTM) variant of RNNs that solves this problem with an additional memory cell (Hochreiter & Schmidhuber, 1997). The two LSTMs both process a sentence, a sequence of words, producing a final hidden state that can be viewed as a summary vector of the sentence. The `1-norm of

the difference of the two summary vectors (i.e. final hidden states) is taken and proposed as the semantic similarity of the two sentences. For a substantiation of the decision to use `1-norm instead of the `2-norm, see the publication of Mueller

and Thyagarajan (2016). This model is referred to as the Manhattan LSTM (MaLSTM) model. The abbreviation (MaLSTM) will be used throughout the paper to refer to this model proposed by Mueller and Thyagarajan (2016). An outline of the model taken from Mueller and Thyagarajan (2016) is shown in Figure 5.

Figure 5: An outline of the MaLSTM model (Mueller & Thyagarajan, 2016) Mueller and Thyagarajan (2016) show that, after training on a sufficiently large amount of data, the MaLSTM model performs better than state of the art models on evaluating semantic similarity between sentences. They show that intricate semantic meanings can be encoded in fixed length vectors, incorpor-ating semantic value of the words as well as the syntactic structure that con-tributes to the meaning of the sentence. This could be specifically promising if the model is applied to resolving references: references can appear as semantic overlap between words, but also as characteristic syntactic structures such as anaphora. Besides its remarkable performance within the domain of sentence similarity evaluation and intuitive applicability to the domain of resolving ref-erences, practical considerations also make the MaLSTM an adequate model to use. Firstly, the simple siamese architecture of the model makes it easily adapt-able to fit new domains (i.e. semantic frames). Secondly, a full implementation is made publicly available online for academic purposes. The implementation comes with pre-trained weights, which makes using the model computationally feasible.

1.8 Approach

In summary, this thesis investigates whether the task of resolving references in goal-oriented dialogue, as defined in Section 1.4, can be approached with

(13)

deep learning architectures making use of semantic vector representations of words as input. In section 1.6, it is argued that resolving references can be seen as a semantic similarity evaluation because the frame that is referenced and the referencing utterance often share semantic information. Because of this argumentation and for all of the reasons described in 1.7, the MaLSTM model (Mueller & Thyagarajan, 2016) is used as a starting point for approaching the task of resolving references. The Maluuba frames corpus, which stores dialogues annotated with frame histories and references, is used as labelled training data. Considering the fact that vector representations carry semantic as well as syntactic information on words, they form a promising basis for deep learning to pick up on higher levels of abstraction that signify patterns in the data.

Although the final goal would be to train a fully scalable reference resolving deep learning architecture, this thesis presents a first step in that direction with architectures that are more scalable than rule-based systems, but still rely on domain-specific information to some degree. Two experiments, making use of the MaLSTM architecture in different ways, are presented.

Initially, a template-based model is introduced in Experiment 1: a full frame history with a maximum length of 9 and a user utterances are fit into a template vector, which serves as input for a deep neural network (in figures denoted with NN). The frames are encoded making use of Word2Vec and the utterance is en-coded using one pre-trained (i.e. for predicting sentence similarity) LSTM from the MaLSTM model (Mueller & Thyagarajan, 2016). The task is interpreted for this first experiment as a multi-class classification: which frame in the frame history is referenced by a user turn? The accuracy of the model is evaluated against a random baseline. In addition, referring back to the argumentation on the resolvability of different types of references with distributional semantics in Section 1.6, it is tested if the extra feature of active encoding improves perform-ance.

It is important to point out that this experiment simplifies the task by setting a maximum for the amount of frames that can be in memory. Also, the template is crafted for this corpus specifically, and, as a consequence, for a specific domain (i.e. vacation packages). Despite this fact, little feature engineering is needed, since the template is based on vector representations of words which makes it more scalable than traditional rule-based systems. The original model and the template-based model of Experiment 1 are shown schematically in Figure 6. The model on the left corresponds with the original MaLSTM model of Mueller and Thyagarajan (2016) that is shown before in Figure 5.

After thoroughly describing the model and evaluating the results of Experiment 1 in Chapter 3, a second architecture is presented in Chapter 4 (Experiment 2). Chapter 2 elaborates on some general methodological assets used in both experiments. Chapter 5 reports the conclusions of this thesis and Chapter 6, finally, discusses shortcomings of this thesis and hints at future work.

(14)

Figure 6: An overview of the original MaLSTM (Mueller & Thyagarajan, 2016) and the architecture of the model presented in Experiment 1

2 General methodology

2.1 Dataset

The Maluuba Frames dataset consists of 1369 dialogues and is annotated in a attribute-value structure. Not all attributes are used for this thesis and some will thus be ignored for reasons of simplicity. An overview of the relevant attribute-value substructure is given in Figure 7 and will now be described more thor-oughly.

All dialogues are labelled with a unique id. The main content of the dialogue is a list of turns, which are all annotated with an author. The author alternately is the user and the wizard. Every turn has a text attribute which contains the utterance of the turn. Every turn also contains a set of special labels: frame history, active frame and acts. The frame history attribute consists of frames created up until that turn. The active frame label is set to a frame number that is currently being discussed. The acts label consists of dialogue acts and references to frames that can be associated with these acts. The frames within the frame history are annotated as well. The relevant attributes are: frame number and info. The frame number is the number of the frame within the frame history and thus denotes the order of the frames.

(15)

Figure 7: An overview of the relevant attribute-value substructure of an annot-ated dialogue

The info attribute stores a summary of the frame. In other words, the info attribute contains a set of constraints which describe the user’s goal at a certain point in the dialogue. These constraints are in the form of slot type-slot value pairs.

The acts themselves consist of name, arguments and reference. The name attribute is the name of the dialogue act (e.g. inform), which will not be used for this thesis. The arguments attribute consist of slot type-slot value pairs that go with the dialogues acts, but those will be ignored too. The only aspect that is used from acts are the references.

References contain a annotations attribute and a frame number attribute. This frame number is important: it stores which frame is referenced by a certain turn and thus forms the basis of the desired output value for the proposed models. Since there can be several dialogue acts per utterance, there can also be several references. It is also possible for an utterance not to contain any references. Those two cases, none or multiple references, are not included in the dataset used for this thesis. The dataset eventually includes only user turns that reference exactly one frame of the frame history.

The dataset comprises of 19986 turns, of which 10407 are user turns and 2909 are user turns with exactly one reference to one frame. In Figure 8, a distribution of number of references per dialogue is displayed.

(16)

Figure 8: Distribution of the number of references per dialogue

2.2 Preprocessing

2.2.1 Frame to vector

For both the models of Experiment 1 and 2, it is necessary to encode frames into fixed-length vectors to use as input. As mentioned before, this will be done mainly by making use of the pre-trained distributional semantics model Word2Vec. The process is shown schematically in Figure 9.

Figure 9: An overview of the frame to vector procedure

Since every frame is defined by a set of slot-value pairs and there is a finite number of unique slot types, a template vector of a fixed length can be made. All the unique slot types are listed in Appendix A. Those slot types have been translated from an abbreviated form into normal form in order to be able to encode them with Word2Vec (e.g. ‘dst city’ is translated into ‘destination city’). There are 61 unique slot types. Slot types are word sequences of varying length, with a maximum length of 3. Values can consist of more than 3 words, but most consist of 1 or 2 words. A template vector is prepared as follows: for every unique slot type the space of 3 word vectors is reserved and again the

(17)

space of 3 word vectors for the value that is associated with that slot type. The vector representations of words are of length 300. This length is chosen, because the MaLSTM model is trained with word vectors of the same size (Mueller & Thyagarajan, 2016). The total length of the vector is thus 300 × 61 × (3 + 3) = 109800. If a slot type is not contained in the set of constraints that define the frame, it is left empty (filled with zeros). Also, slot types or slot values that are made up by less than 3 words will be padded with zeros in order to end up with vectors that are built up in the exact same way.

2.2.2 Text preprocessing

Since most frames and utterances are both text-based, it is important to make sure the text is preprocessed in an effective way to supply the models with meaningful input. The text has to be preprocessed to maximize the chance that the basic units, tokens, are included in the Word2Vec model. If tokens after processing are not included in the Word2Vec model, they are skipped. Utterances and the textual content of frames is first stripped from punctuation making use of the String module of Python5. After that, all textual content is tokenized using nltk’s TweetTokenizer6 resulting in a list of tokens.

Numbers that are written in digit notation appear in the dataset and need to be translated into numbers in word notation (e.g. ‘19’ has to be translated into ‘nineteen’). This translation is essential to matching referencing utterances to frames because they often appear as values such as dates and prices and thus can be crucial to the semantic similarity between the two. A publicly available Python module Inflect7 _{which includes a ‘number to words’ method is used to}

perform this translation.

3 Experiment 1: template-based model making

use of the MaLSTM model

3.1 Model description and motivation

As described in the Introduction, the first experiment of this thesis approaches the task of resolving references to frames as a multi-class classification task. The model receives a frame history and a user turn as input and has to predict which frame of that history is referenced. The positions of frames in the history are thus interpreted as classes. This experiment simplifies the task by setting the maximum length for frame histories to 9 frames. 87% of all dialogues have a frame history of 9 or less. This share is considered to be large enough to test whether this simpler model could resolve references better than chance. A template vector is created to fit a frame history and an encoded utterance. This template is used as input for a deep neural network, which is trained to

5

https://docs.python.org/2/library/string.html

6_{http://www.nltk.org/api/nltk.tokenize.html} 7_{https://pypi.org/project/inflect/}

(18)

predict class probabilities. The utterance is encoded making use of one LSTM from the MaLSTM model (Mueller & Thyagarajan, 2016). The LSTM is pre-trained for the task of predicting similarity between sentences as described in Section 1.7. An utterance is provided to the LSTM word by word, resulting in a final hidden state vector. Subsequently, this fixed length vector is fit into the template. An overview of this template-based process is outlined in Figure 10.

Figure 10: Overview Experiment 1: template-based model making use of the MaLSTM model

In Section 1.6 it is argued that active encoding and recency encoding could be added as extra features to increase the discoverability of certain types of references. In this experiment the addition of recency encoding would be super-fluous, because the frames of a history are fit into a template vector respecting the order of the frames. The resulting template vector thus indirectly carries this information already. Active encoding on the other hand, is tested as an extra feature.

(19)

comprehensible architecture whether deep learning can pick up on the right pat-terns in the data to predict which frame is referenced by a user turn. Although the template makes the model not automatically scalable to new domains, it is a first step away from rule-based systems towards data-driven systems requir-ing no feature engineerrequir-ing at all. Also, this experiment implicitly investigates whether the hidden state vectors produced by the MaLSTM carry enough se-mantic information as well as information on syntactic structures of references. The performance of this model is expected to transcend the benchmark of ran-dom performance (see Section 3.3 for an elaboration on this baseline). The ad-dition of active frame encoding is expected to increase the performance.

3.2 Creating a frame history template

As said, the pre-trained LSTM from the MaLSTM model of Mueller and Thy-agarajan (2016) is used to process the sentences, and the frame to vector method from section 2.2.1 is used to prepare the frames for the model. Those are all concatenated and padded with zeros to create a fixed length frame history vec-tor. One frame vector is of length of 109800 and the frame history vector is thus of length 9 × 109800 = 988200. If a frame history for instance is of length 7, the first 109800 × 7 = 768600 units of the vector are used for encoding the history and the remaining 109800 − 768600 = 219.600 units are filled with zeros. The hidden state vector ∈ R50 _{that the LSTM produces is concatenated to the}

frame history vector. The total length of the template vector that serves as input of the model is then 988200 + 50 = 988250. If active frame encoding is used as an extra feature, a one hot encoded vector is used to encode which frame is active. If this feature is included the total vector is extended by 9, resulting in a 988259 unit long vector.

3.3 Network architecture, training protocol and evaluation

As indicated before, only user turns that refer to exactly one semantic frame are included in the dataset. As a result, the classes are mutually exclusive and the labels are singular. To arrive at outcomes that match this type of classification, the neural network must have the same amount of output nodes as there are classes. Instead of real-valued output, a probability distribution over the classes is to be obtained, in order to be able to make class predictions from the output. This can be achieved using the softmax function: the softmax function squashes all real-valued output nodes into the range [0,1] and ensures that, after applying it, all values add up to 1.

A neural network learns an objective by minimising a differentiable loss func-tion. Categorical cross entropy H is used as a loss function for this single-label, multi-class classification task. Categorical cross entropy evaluates the differ-ence between the true distribution of the categories against the predicted one (Chollet, 2017). That difference is 0 if the distributions are the exact same and the predictions as a result perfect. The equation for this loss function is given below in Equation (1). yi denotes the ground-truth label for class i and ˆyi the

(20)

H(yi, ˆyi) = −

X

i

yilog( ˆyi) (1)

The deep neural network consists of 2 hidden layers. The input layer is made up of 109800 or 109809 units, depending on whether the extra feature of act-ive encoding is included in the input. The first hidden layer is consists of 100 nodes and the second hidden layer of 50. Due to computational complexity (one epoch of 1684 instances on a 2 GHz Intel Core i7 CPU takes approximately 70 seconds) and research time limitations, this architecture has been chosen as the deep architecture to work with. This choice is thus somewhat trivial. All layers are fully connected, which means that the nodes have connections to all the out-comes (activations) of nodes in the previous layer. The activation function used for the hidden layers is the relu function, because it is non-linear, prevents gradi-ents from vanishing and is widely used in deep learning architectures (Chollet, 2017). The network is trained using the unpublished RMSprop method (Hinton, Srivastava, & Swersky, 2013). Details about the configuration of the LSTM can be found in the paper of Mueller and Thyagarajan (2016).

The performance of the predictions of the model is evaluated using accuracy as a measure. Accuracy is simply defined as the fraction of predictions that lead to correct classifications. Before measuring accuracy, actual class predictions have to be made from the predicted probability distribution. This is done by choosing the class with the highest predicted class probability p:

choose i if p(fi|utterance) == max

k p(fk|utterance) (2)

The training procedure used is fold cross validation. Before performing 10-fold cross validation, the data is split into a training (80%) and test (20%) set. For 10-fold cross validation, the training set is split up in 10 equally large sets. In every cycle, one of these folds is used as validation set and the other 9 folds to train on. Since parameters are optimised while repeating this procedure on the (shuffled) training data and maximising the validation performance, knowledge about this validation data ‘leaks’ into the model. To avoid the risk of overfitting, performance eventually is evaluated by means of the test set that has been kept aside from the beginning of the training procedure and can thus serve to test for performance on ‘unseen’ data. The 10 models that are trained during the 10-fold cross validation procedure are used to make predictions on the test set. The hypotheses for this experiment (stated in Section 3.1) are expressed in terms of performance relative to a random baseline. The random baseline is calculated making predictions following the same testing procedure: 10 times, a random prediction of the same size as the test set is made. For every instance in the test set, a random distribution that adds up to 1 is ‘predicted’. The same decision function (2) is applied for making actual random class predictions.

3.4 Results and discussion

In Figure 1, the results of the performance of 10 models resulting from 10-fold cross validation on the test set are displayed in the form of means and standard

(21)

deviations. The random baseline is displayed together with the template-based model with two input variations: frame history input with the addition of active encoding as an extra feature and without. The main observation of this experi-ment is that the template-based model of this experiexperi-ment performs better than the random baseline, regardless of the addition of the extra, handcrafted fea-ture.

Accuracy Cross entropy Random performance 0.11±0.00 15.15±0.29 baseline c.i.: 0.000 c.i.: 0.183

Template-based model 0.54±0.04 2.10±0.05 (no extra features) c.i.: 0.025 c.i.: 0.032

Template-based model 0.56±0.04 2.09±0.11 (with active encoding) c.i.: 0.025 c.i.: 0.070

Table 1: Performance of the template-based model in comparison with the ran-dom baseline: mean, standard deviation and confidence interval (c.i.) of accur-acy and cross entropy loss

To make judgements about the difference in performance between the template-based model that receives active-encoding as additional input and the one that does not, we can look at confidence intervals (c.i.) for the means (shown under-neath the means) of the accuracy and the cross entropy. The null-hypothesis for normality is not rejected (P-values > 0.05) and it is thus assumed that the distributions the of two groups are normal for both performance measures. The confidence interval for a mean is calculated as displayed in Equation (3). N denotes the number of samples, which is this case is 10.

Confidence interval = 2 × standard deviation√

N (3)

The 95% confidence interval for both the mean accuracies of the template-based model variations is 0.025, since they have the same standard deviation and sample size. Adding two times this interval to the lower accuracy mean, it can be seen that the higher accuracy mean falls within the confidence interval of the lower mean (0.59 > 0.56). In other words, the the confidence intervals overlap. What follows is that the means do not differ significantly and no statements can be made about a difference in performance between the template-based model that receives active encoding as a extra feature and the one that does not. This same calculation can be applied to the cross entropy means of the two groups and yields the same outcome: there is no significant difference between the cross entropy mean using no extra features and the cross entropy mean using active encoding as a extra feature.

Figure 11 shows the (train and validation) loss and accuracy over 20 epochs for the template-based model for the two input variations. The lines denote the mean and the lighter areas the boundaries of the range over 10 folds. The loss graphs show that the model overfits the data: the validation loss, after a

(22)

Figure 11: Loss and accuracy during training over 20 epochs

small drop, continues to rise overall, while the training loss keeps on decreasing. This supports the suspicion that the model might be too complex for the data. In other words, ‘deep’ learning might not be necessary for this task given this template-based input. A shallower model might be sufficient. Nevertheless, more architectures and parameter configuration should be tested to support this idea, but this is beyond the scope of this thesis. The insight that such a simple architecture trained with template-based input is enough to exhibit better than chance performance. That is enough to proceed to a more intricate architecture presented in the following chapter. Apparently, the final hidden state vectors produced by the pre-trained LSTM produce meaningful features in terms of resolving references.

4 Experiment 2: asymmetrical adaptation

of the MaLSTM model

4.1 Motivation

The model of Experiment 1, the template-based model, has shown to perform better than chance on the task of predicting which frame a referencing user turn refers to. On the one hand, referring back to the previous elaborations

(23)

on the desired scalability of the reference resolving models, the template-based model can be seen as a starting point, because the input is mainly based on word vectors which minimises feature engineering. On the other hand, there are two main reasons why this model is still quite limited in terms of scalability. The first reason is that the task has been simplified significantly, with setting the maximum number of frames that can be present in the history of frames to 9. The second reason is that the template structure of the input makes that the word vectors are not that different from hand-crafted features. Put differently, since there is a limited number of slot types and because these slot types are consistently positioned in a frame vector template, the slot types can be seen as features which are determined by the annotators of the Maluuba frames corpus (Asri et al., 2017) because they appear often in this domain, but are not inherent to the problem space which this model is supposed to scale to (i.e. all goal-oriented dialogue).

Experiment 2 will serve as a next step in this investigation towards fully data-driven reference resolving models. In addition, a slightly different point of view on this matter is taken on. Where Experiment 1 focuses on performing the main task of this thesis in a practical sense, this experiment focuses, apart from performing the task, more on what has been previously described as interpreting the task of resolving references as a semantic similarity evaluation (Section 1.6). It has also been described that the MaLSTM model was originally designed to do exactly this for pairs of sentences. For those reasons, an architecture that is more similar to the original MaLSTM model will now be presented to resolve references annotated in the Maluuba frames corpus.

This experiment implements an asymmetrical adaptation of the MaLSTM model. Figure 12 displays the asymmetrical adaptation next to the original MaLSTM model, which again corresponds to the one displayed in Figure 5. In the case of Mueller and Thyagarajan (2016) the siamese recurrent architecture is used to compare symmetric domains: both LSTMs encode sentences. For this reason, the weights of the two LSTMs in the MaLSTM are tied. For the weights to be tied means that the two LSTMs are in fact the exact same model with the same weights and weight updates. In this thesis, the domains that are being compared semantically are asymmetrical. A user turn (i.e. an utterance) is encoded on the one hand and a semantic frame, a set of slot-value pairs, on the other hand. To make the model fit asymmetrical domains, the weights should not be tied (Mueller & Thyagarajan, 2016).

Besides the fact that the domains are asymmetrical, the domains are also of a different nature. Utterances are sequential in nature, while semantic frames are not: utterances are a series of words, frames are a set of constraints with no specific order. To heed this inherent lack of sequentiality in frames, this experiment implements a deep neural network instead of a LSTM to encode frames. The fact that the set of constraints is finite and of a computationally reasonable size, makes that a template for the input of the deep neural network can be constructed as described in 2.2.1.

This asymmetrical adaptation of the MaLSTM model measures the similarity between the final hidden state of the LSTM and the output of a deep neural network to decide if a frame is referenced by an utterance. The task in this ex-periment is posed as a binary classification: does a user turn refer to a certain

(24)

Figure 12: Overview original MaLSTM model Mueller and Thyagarajan (2016) and the asymmetrical adaptation of experiment 2

frame or not? This means that all frames and utterances are being evaluated without the context of the full history of frames. The semantic similarity is to be judged on the basis of only an utterance and a frame, not with the know-ledge that other frames may be more or less similar than the one in question. It is expected that this makes it challenging to learn an appropriate decision function.

The objective would be for the model to learn to choose the frame from a history of frames that has the highest similarity with an utterance as the ref-erenced frame. Since such a decision function, making use of an arguments of the maxima operator, is not differentiable, it can not be used to train with. The decision function can be applied after training, when making predictions on test data. Although it might be complex to learn the right objective, the task is approached in this thesis under the assumption that the dialogue system is memory-enhanced: frame-utterance pairs are only evaluated for potential references if a frame indeed exists in the frame history associated with that utterance.

It is hypothesised that frame-utterance similarity scores predicted by this asym-metrical adaptation of the MaLSTM model with a decision function applied after training will be sufficient to make predictions about references that outperform random performance (see Section 4.5). For reasons described in 1.6, it may be the case that more references of type 1 and 2 are being classified correctly, than

(25)

of type 3. Type 3 may be classified more accurately with the addition of the recency encoding and active frame encoding features. The overall performance is thus expected to increase with the addition of these features.

4.2 Model description

As described in 4.1, the model of Experiment 1 is trained to predict a similarity between a frame and an utterance. For every user turn that references a frame, all frames of the frame history are added to the dataset. This results in data instances that are in fact frame-utterance pairs. The frame that is referenced by the utterance will be labelled with a 1, while the other frames from the frame history are labelled with a 0. This handling of the data is shown in Figure 13. After the dataset of 2909 user turn that contain exactly one reference is trans-formed in frame-utterance pair dataset, it consists of 16235 instances.

Figure 13: Overview of how the data is handled in Experiment 1 Each frame-utterance pair is split before it is given as input to the model. The frame, after being processed into a vector (∈ R109800) as described in Section 2.2.1, is conducted through the deep neural network resulting in an output vector y ∈ R50. The utterance, once split into words and processed into word vectors (∈ R300_{), will be fed into the pre-trained LSTM. After the whole sequence of}

word vectors has been processed by the LSTM, the final hidden state h ∈ R50

is taken and measured against the deep neural network output vector y in a similarity function. This process is shown in Figure 14.

(26)

Figure 14: Overview Experiment 1: asymmetrical adaptation MaLSTM model

4.3 Extra, hand-crafted features

Besides encoding the frames, extra features can be added to the frame vectors: recency encoding and active frame encoding. Recency encoding is a one hot encoding of the number of the frame. The longest frame history in the dataset consists of 35 frames. If this feature is being used, the feature vector is concat-enated to the original frame vector and the frame vector is thus extended by 35. In theory, a frame history can be infinitely long and this one hot encoded feature might thus not be able to handle future, unseen instances. In practice, frame histories of such length are not expected to appear often and this limit is thus not expected to be too restrictive. Active frame encoding is achieved with a binary feature: if the frame is active, the feature is set to 1 whilst being 0 otherwise. If this active frame encoding is used, it is added to the original frame vector extending it with 1. In total, 4 variations of input are to be tested: the frame vector with no extra features ∈ R109800, the frame vector with active encoding ∈ R109801, the frame vector with recency encoding ∈ R109835 and the frame vector with both of the aforementioned features ∈ R109836.

(27)

4.4 From training to making predictions

The similarity score ∈ [0,1] is calculated using the following simple function proposed by Mueller and Thyagarajan (2016):

f (y, h) = exp (−||y − h||1) (4)

The norm that is taken is a `1norm (Manhattan distance):

`1(y, h) = ||y − h||1= n

X

i=1

|yi− hi| (5)

For training, the Mean Squared Error (MSE) between the the similarity scores and labels is taken as a loss function:

M SE = 1 n n X i=1 (yi− ˆyi)2 (6)

MSE is an absolute error loss function, which is suitable to use since the sim-ilarity and label are both in the same range [0,1]. Because this model trains to predict similarity scores and does not make a binary classification predic-tion yet, a decision funcpredic-tion is applied afterwards to predict whether a frame is referenced by an utterance or not. The decision function simply classifies the highest similarity score of a group of frame-utterance pairs (f_i(n), un) as 1 and

the others as 0. f_i(n)denotes the ith frame within the frame history associated with utterance un and un denotes the nth utterance in the dataset of all user

turns that are annotated with a reference to exactly one frame.

classify(f_i(n), un)as ( 1 if (f_i(n), un) == argmaxk(similarity(f (n) k , un)) 0 otherwise (7)

As said before, the decision function is non-differentiable and can thus not be used for training (Alpaydin, 2014). For that reason, the decision function will be applied after making similarity predictions on unseen data in the testing phase. This difference between training phase and testing phase is illustrated in Figure 15.

(28)

Figure 15: Schematic visualisation of difference between training and predicting

4.5 Evaluation

As indicated before, the decision function maps the similarity scores to binary labels. Those binary predicted labels can be evaluated in the testing phase. Accuracy, precision and recall can be used to measure the performance of the trained model on unseen data. Considering the fact that per utterance a whole frame history is put into the dataset which consists mainly of negatively labelled frames, the dataset is highly skewed. The effect of this characteristic is that the accuracy measure can be deceiving. To be more concrete: if all the data points are predicted as negative by the model, the accuracy will be relatively high, but the performance can still be poor. Precision and recall are evaluation methods that overcome this issue by measuring how many frames that are classified as referenced by a user turn are true and how many true references are correctly identified respectively. The formulas for these evaluation measures are shown in Equations (8) and (9). TP stands for True Positives: correctly identified references between a frame and utterance. FP stands for False Positives: frame-utterance pairs that are incorrectly classified as containing a reference. TN stands for True Negatives: frame-utterance pairs that are classified as non-referencing correctly. FN stands for False Negatives: frame-utterance pairs that are classified as non-referencing but a reference between them actually exists.

Precision = TP

TP + FP (8)

Recall = TP

(29)

As hypothesised before, the model is expected to exceed random performance. A random baseline for this experiment is created as follows: all frame-utterance pairs are assigned a random value in the range [0,1], serving as a random sim-ilarity prediction. The decision function (7) is applied for random predictions. This seems equivalent to randomly picking one frame of a frame history as the referenced frames, but in that case no random MSE baseline could be calculated for evaluation.

4.6 Network configuration and learning protocol

The deep neural network is a densely connected network consisting of 2 hidden layers: the first layer consists of 100 nodes and the second of 70. The output layer consists of 50 nodes. Since values of the final hidden state of the LSTM are in range [-1,1] and are measured against the outcome of the deep neural network , the outcome of the deep neural network should be in the same range. For that reason, the tanh activation function, of which the output ranges between the same values, is used for the output layer. The other layers again have a relu activation function. The network is trained using Stochastic Gradient Descent as an optimization method and the loss function is MSE. Details about the LSTM configuration can be found in the paper of Mueller and Thyagarajan (2016).

The same training protocol is used as in Experiment 1. The only difference is that the data is not shuffled, to keep the frame histories intact.

4.7 Results and discussion

Table 2 displays the performance of the asymmetrical adaptation of the MaL-STM on the task of resolving references in terms of accuracy, precision, recall and MSE. No means presented in Table 2 are proven not to come from a normal distribution (again tested with a significance level of 0.05), normality is thus assumed for all means presented in the table. The confidence intervals (c.i.) of the means are inspected in the same manner as in results of Experiment 1 (Section 3.4) and shown below the means and standard deviations.

The main result is that this model, no matter what input variation is used, performs significantly better than the random baseline. The accuracy, precision and recall are 96%, 92% and 89% on average using both extra features, recency encoding and active encoding, as extra input.

Overlap in confidence intervals (95%) between means of one column are high-lighted with the same color. The confidence means of the performance of the input variation with no extra features and the input variation with only recency encoding overlap in all measures except for MSE (highlighted with blue). Thus, no significant difference between the performance of these two input variations exists. The same applies to the difference in performance of the input variations with active encoding and the performance with all features: two out of four per-formance measure means have overlapping confidence intervals (highlighted with purple).

(30)

Accuracy Precision Recall MSE Random performance 0.75±0.01 0.24±0.02 0.24±0.02 0.33±0.01 baseline c.i.: 0.006 c.i.: 0.006 c.i.: 0.013 c.i.: 0.006

Without extra features 0.80±0.01 0.47±0.02 0.46±0.01 0.16±0.00

c.i.: 0.006 c.i.: 0.006 c.i.: 0.006 c.i.: 0.006

With active encoding 0.96±0.00 0.91±0.01 0.89±0.01 0.11±0.00

c.i.: 0.006 c.i.: 0.006 c.i.: 0.006 c.i.: 0.000

With recency encoding 0.80±0.01 0.48±0.03 0.47±0.03 0.15±0.00

c.i.: 0.006 c.i.: 0.006 c.i.: 0.018 c.i.: 0.000

With recency encoding 0.96±0.00 0.92±0.00 0.89±0.00 0.11±0.00 and active encoding c.i.: 0.000 c.i.: 0.000 c.i.: 0.000 c.i.: 0.000

Table 2: Performance of the asymmetrical MaLSTM adaptation in comparison with the random baseline: mean with confidence interval (c.i.) and standard deviation of accuracy, precision, recall and MSE

It can thus be observed that the addition of the recency encoding feature does not improve the overall performance, while the addition of the active encoding feature does. Especially the precision and recall are improved significantly by adding active encoding as an extra feature.

Figure 16 shows the development of the loss over training epochs. All input vari-ations are plotted with a mean (line) and range (lighter filled areas) of training and validation loss of the 10-fold cross validation procedure. The plots show that the model switches from underfitting to not underfitting anymore some-where in the middle of the total amounts of epochs. This point is characterized by the elbow shape of the the lines. It is common that when a model stops un-derfitting, it starts overfitting, but here that is not the case. The training loss keeps on decreasing, but the validation loss does not increase from the elbow point onwards.

(31)

Figure 16: Training and validation loss of the model of Experiment 2 over 100 epochs for all input variations

5 Conclusion

In summary, this thesis has investigated whether references to semantic frames annotated in the Maluuba frames corpus can be resolved making use of deep learning architectures and distributional semantics. This is relevant to explore because it could lead to scalable, data-driven models. Besides that, it is argued that the task can be interpreted as a semantic similarity evaluation. For that reason the MaLSTM model proposed by Mueller and Thyagarajan (2016) that is designed to evaluate sentence similarity, is used to approach this task in two experiments. As a recap, an overview of the original MaLSTM model and the models of Experiment 1 and Experiment 2 are outlined in Figure 6 from left to right in that order.

(32)

Figure 17: Overview original MaLSTM model (Mueller & Thyagarajan, 2016) and the models of Experiment 1 and 2

The first experiment presents a template-based model. In this experiment, the task is interpreted as a multi-class classification task. The model has to predict which of the (max) 9 frames in history is referenced by a certain user turn. The user turn and frame history are fit into a template vector. The performance of the model exceeds random performance with approximately 45%. It has not been demonstrated that active encoding boosts the performance significantly. In addition, plots of the training procedure show that the model overfits the data and the idea is thus supported that a shallower architectures might be sufficient to approach the task as presented in this experiment.

The scalability of this model is a first step in the direction of fully scalable mod-els, but the fact that the template is purpose-built and thus domain-specific prevents the model from being fully scalable. It is noteworthy that, although the task has been simplified, the performance of the architecture is quite a lot better than chance. This implies that it is plausible that the hidden state vec-tors produced by the MaLSTM (Mueller & Thyagarajan, 2016) carry valuable semantic as well as syntactic information for this task.

The second experiment describes the implementation of a asymmetrical adapt-ation of the MaLSTM model. This model does not simplify the task and is thus more scalable than the model of Experiment 1. The model predicts similarities between frames and utterances to decide whether a frame is referenced by a user turn or not. The task in this experiment is interpreted as a binary classifica-tion task. A similarity funcclassifica-tion within the architecture causes the deep neural network to counteract the LSTM in producing vectors. In other words: when a frame is referenced, the output vector should be more similar to the hidden state vector representing a user turn than when it is not. The performance of this model remarkably exceeds the random baseline by 49% on accuracy, by 45%

(33)

on precision and by 43% on recall for the best input variation. The best input variation is to add both active encoding and recency encoding as extra features to the input. The addition of recency encoding does not have a significant effect on the performance, however. In practical applications, it is thus recommended not to use recency encoding.

All in all, it has been shown that approaching the task of resolving references to semantic frames in dialogue, making use of semantics and deep learning architec-tures, is a promising undertaking. First steps towards fully scalable, data-driven are proposed with models that require input that requires minimal feature en-gineering. It has also been demonstrated that such a task can be interpreted as semantic similarity evaluation, making use of the MaLSTM proposed by Mueller and Thyagarajan (2016) in a practical way.

6 Limitations and future research

Although the results are quite promising, it has to be noted once more that the use of this corpus is leading for the performance of the model. Put in other words, the model learns to predict references as annotated in this corpus and generalisation to new dialogue is not shown. This applies specifically for the results of Experiment 2. Apparently, the task as it is presented in this thesis is relatively easy to solve for this model, but it cannot be determined with certainty if that is because the model is very fit for the task, or because the task is unambiguous in many cases. It could for example be the case that many references are of type 1 (i.e. with overlap in word use between a frame and an utterance) and therefore easy to solve.

For that reason, it could be interesting to analyse the results further. Qualitative analysis could be performed: what type of references are solved more accurately than others? This could give additional insight into what patterns the deep neural networks pick up on and what specific assets of the neural networks lead to the quantitative results displayed in the Result sections of Chapter 3 and 4. In addition, it would give more insights on what features, syntactic and semantic, from the utterances are captured in the final hidden state vectors produced by the LSTM taken from the MaLSTM model.

Additionally, as noted in both experiments, deep neural network configurations are chosen through trial and error and are semi-trivial in the sense that not all options are systematically evaluated due to computational complexity of the algorithms. This applies to architectures, parameters and optimisation meth-ods as well. If more time would be available, this could be explored further, especially for Experiment 1. In addition, the use of GPU for such a project is recommended to be less obstructed by computational complexity of deep neural networks.

Aside from the above-mentioned options for extending this thesis, it could be interesting to conduct further research into resolving references making use of corpora that are not domain-specific. Semantic frames of a more diverse nature could potentially be more suitable for semantic similarity evaluation,

(34)

since they would be more distinct from each other. Semantic frames from di-vergent domains are more similar to the notion of ‘topic’ in non-goal-oriented dialogue.

Lastly, the next step towards scale-free, data-driven reference resolving tech-niques for dialogue systems in line with the findings of this thesis would be to, instead of implementing a normal deep neural network, implementing a second LSTM in the architecture of Experiment 2. Although the semantic frames are not sequential in nature, LSTMs can deal with input of varying length and thus with frames that are not fit into a template but just fed into the model word by word, making use of only semantics.

(35)

References

Alpaydin, E. (2014). Introduction to machine learning. MIT press.

Asri, L. E., Schulz, H., Sharma, S., Zumer, J., Harris, J., Fine, E., . . . Suleman, K. (2017). Frames: A corpus for adding memory to goal-oriented dialogue systems. arXiv preprint arXiv:1704.00057 .

Bordes, A., Boureau, Y.-L., & Weston, J. (2016). Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683 .

Chai, J. Y., & Jin, R. (2004). Discourse structure for context question answering. In Proceedings of the workshop on pragmatics of question answering at hlt-naacl 2004.

Chollet, F. (2017). Deep learning with python. Manning Publications Co. Filisko, E., & Seneff, S. (2003). A context resolution server for the galaxy

conver-sational systems. In Eighth european conference on speech communication and technology.

Grosz, B. J., & Sidner, C. L. (1986). Attention, intentions, and the structure of discourse. Computational linguistics, 12 (3), 175–204.

Henderson, M. (2015). Machine learning for dialog state tracking: A review. In Proc. of the first international workshop on machine learning in spoken language processing.

Hinton, G., Srivastava, N., & Swersky, K. (2013). Lecture 6a: Overview of mini-batch gradient descent. Coursera Course.

Hirschberg, J., & Manning, C. D. (2015). Advances in natural language pro-cessing. Science, 349 (6245), 261–266.

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9 (8), 1735–1780.

Jurafsky, D., & Martin, J. H. (2017). Speech and language processing (3rd edition draft).

Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning (pp. 1188– 1196).

LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. nature, 521 (7553), 436.

Lowe, R., Pow, N., Serban, I., & Pineau, J. (2015). The ubuntu dialogue corpus: A large dataset for research in unstructured multi-turn dialogue systems. arXiv preprint arXiv:1506.08909 .

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 . Mikolov, T., Le, Q. V., & Sutskever, I. (2013). Exploiting similarities among

languages for machine translation. arXiv preprint arXiv:1309.4168 . Mueller, J., & Thyagarajan, A. (2016). Siamese recurrent architectures for

learning sentence similarity. In Aaai (pp. 2786–2792).

Schulz, H., Zumer, J., Asri, L. E., & Sharma, S. (2017). A frame tracking model for memory-enhanced dialogue systems. arXiv preprint arXiv:1706.01690 .

Serban, I. V., Lowe, R., Henderson, P., Charlin, L., & Pineau, J. (2015). A sur-vey of available corpora for building data-driven dialogue systems. arXiv preprint arXiv:1512.05742 .

Swerts, M., Litman, D., & Hirschberg, J. (2000). Corrections in spoken dialogue systems. In Sixth international conference on spoken language processing.

(36)

Weizenbaum, J. (1966). Eliza—a computer program for the study of natural language communication between man and machine. Communications of the ACM , 9 (1), 36–45.

Yu, L., Hermann, K. M., Blunsom, P., & Pulman, S. (2014). Deep learning for answer sentence selection. arXiv preprint arXiv:1412.1632 .

Resolving references in memory-annotated dialogue through semantics