Task-Oriented Dialog Agents Using Memory-Networks and Ensemble Learning

(1)

MSc Artificial Intelligence

Master Thesis

Task-Oriented Dialog Agents Using

Memory-Networks and Ensemble Learning

by

Ricardo Fabi´

an Guevara Mel´

endez

11390786

August 23, 2018

36 credits January - June 2018

Supervisor:

Maarten Stol

Assessor:

Shaojie Jiang

(2)

Abstract

Task-oriented dialog agents are ever more relevant systems to engage with an user through voice or text natural language input to fulfill domain-specific tasks. Recently, neural based approaches are gaining popularity over traditional rule based systems thanks to promising results (Williams et al., 2017) in standard tasks (Bordes et al., 2016). While most modern approaches follow an architecture based on a pipeline of components (Young et al., 2013; Bocklisch et al., 2017; Williams et al., 2017), there is still much room for design in these architectures, and many promising models that can be incorporated in the pipeline. One such pipeline architecture is the Hybrid Code Networks (HCN) (Williams et al., 2017), which uses example human-bot conversations to train an LSTM (Hochreiter and Schmidhuber, 1997) to track conversation state and predict the next action. Inspired by the promising results from Bordes et al. (2016), this thesis focuses on building an architecture similar to HCN but improving the action policy by using a Memory Network instead of an LSTM and then using an ensemble that combines both the Memory Network and the LSTM from HCN into a single action policy. By combining the benefits of the HCN architecture with the Memory Network and the ensemble policy, the model achieves perfect scores in bAbI task 5, almost 4% accuracy improvement over Bordes et al. (2016) and almost 8% higher than the LSTM action policy from Williams et al. (2017), while the ensemble policy by itself proved responsible for more than 1% improvement over the HCN architecture on bAbI task 6. The final results prove not only the advantages of a Memory Network over an LSTM for some scenarios, but more importantly, the fact that both policies complement each other and benefit of working together, even when trained with the same data.

(3)

Acknowledgements

This work would not had been possible without the help of my assessor Shaojie Jiang. His expertise, expressed in the form of feedback as well as setting the right priorities, dramatically increased the quality of the descriptions and results presented here.

Of equal importance was the role played by BrainCreators B.V., which provided the environment and day-to-day support to make this possible. Special thanks to my company supervisor Maarten Stol, who allowed and encouraged the creativity that enabled me to work on my research interests in Dialogue Agents.

Finally, I want to thank all the dear friends that constantly provided their support during the last 2 years, nearby and from afar. And specially, to my mother and Sim´on, without whom this long awaited dream would not had been possible.

(4)

5 Experimental Setup 22 5.1 Implementation Details . . . 22 5.1.1 NLU . . . 22 5.1.2 Memory Network . . . 22 5.1.3 LSTM . . . 23 5.1.4 Stacking Ensemble . . . 23 5.2 Test Conditions . . . 23 5.2.1 NLU in Isolation . . . 23 5.2.2 Policy Performance . . . 24 6 Results 26 6.1 NLU Performance in Isolation . . . 26

6.2 Policy Performance . . . 26 6.2.1 Task 5 . . . 26 6.2.2 Task 5 OOV . . . 29 6.2.3 Task 6 . . . 30 7 Conclusions 33 7.1 Summary . . . 33 7.2 Future Work . . . 34

A Task 6 bot templates 37

(5)

(6)

Chapter 1

Introduction

Dialog Agents are regarded both as an ultimate proof of actual machine intelligence and a very hard task since even before the term Artificial Intelligence (AI) was coined by Turing (1950). While there is still a long way to achieve general domain dialog agents that can really deceive a human, there is a simpler version that can be achieved with varying degrees of success with current technology. These are task-oriented dialog agents. While many authors (Chen et al., 2017; Vlad Serban et al., 2015) make no further classification than task-oriented or non task-oriented chatbots, Jurafsky and Martin (2018) use the term chatbot exclusively for open domain or chitchat bots, while using the term ‘frame based dialog agents’ to refer to bots that fill slots by asking information to the user until having sufficient input to perform a task and provide an answer. This fits the description of the models explored in this work, and will henceforth be referred to as ‘task-oriented dialog agents’. These domain constrained models are even more relevant for industries, eager for automation (e.g. customer support, question answering or transaction processing). In these scenarios the agent only needs to deal with a constrained domain of possible actions to execute, and decide one after each user query in a sequence. This work is further constrained to text input only (as opposed to spoken dialog) where choosing an action means to select a text template answer. This is appropriate for practical scenarios, since many such bots are featured on online chat windows on a website or leverage existing messaging platforms such as Facebook’s messenger or Telegram.

Until recently, task-oriented dialog agents have traditionally worked as completely deterministic rule-based systems. This approach is not only expensive to engineer (and especially to maintain as the number of rules grows large), but also limited in the amount of context the agent can use to provide sensible answers. The amount of rules just grows dramatically fast when more context is considered to make the next decision. In fact, many such agents just detect words and select an answer based on it (Bordes et al., 2016), in a Q&A fashion with no further context.

Today’s availability of data and computing power allow for machine learning approaches that do more than just detecting keywords and acting upon them. At the same time, tackling this as the problem of choosing the right action at each step, encourages the use of a variety of classifiers available. Young et al. (2013) models this problem as separate tasks handled by a pipeline of modules (see figure 2.1) including a natural language understanding module to handle the inputs from the user, a representation of the conversation state, a policy to decide the next action based on this state and a natural language generator module to output an answer corresponding to the action. Several authors (Williams et al., 2017; Bocklisch et al., 2017) propose to use an LSTM (Hochreiter and Schmidhuber, 1997) as action policy, learning the right action from example conversations. The Hybrid Code Networks (HCN) architecture (Williams et al., 2017) successfully integrates a neural approach to the action policy, with domain specific rules to guide the learning process. On the other hand, Bordes et al. (2016) proposed the use of a Memory Network (Weston et al., 2014) action policy, since this recent model family has achieved promising results in tasks related to sequence processing, even outperforming LSTM and other RNN architectures at language modeling (Sukhbaatar et al., 2015). This Memory Network action policy obtains the highest results from all the baselines and tasks in their experiments.

The goal of this thesis is to start from the HCN architecture and then improve it in 3 different ways, specifically:

• To propose Memory Networks as an action policy, due to its appropriate structure (further explained in section 3.3.2). Although this was already proposed by Bordes et al. (2016), they used a Memory Network to select an utterance from a set of several thousands at each turn in the dialog. This thesis however, makes better usage of the model, by implementing the template actions from HCN instead, which dramatically reduces the number of classes from several thousands to less than a hundred. This achieves perfect scores in the toy dataset, improving the results obtained by Bordes et al. (2016) in action accuracy by 3.9% and by 50.6% in percentage of perfect dialogs.

(7)

from Williams et al. (2017) plus the Memory Network. This led to a consistent improvement in all test scenarios, and to new state-of-the-art results on the hard dataset (bAbI task 6) with more than 1% turn accuracy increase in some scenarios.

• To incorporate and show the benefits of using a Natural Language Understanding module to deal with the user inputs and compute features. While the ‘input features’ Williams et al. are straightforward and effective, it puts an extra effort on the action policy, since it has to learn from high dimensional noisy inputs. An NLU on the other hand is a model specialized in dealing with natural language input. It produces a simple label that captures the meaning of the text, effectively freeing the action classifier from having to deal with all the different ways a user can express the same idea. This produces slight but consistent improvements in the toy dataset. As an added benefit of the NLU, it can also perform Named Entity Recognition (NER), a necessary task for slot filling (e.g. request area and date to book a hotel). Williams et al. relies on simple regular expressions to look for known names in the user text, but this cannot generalize to unseen entity values which are a common occurrence. The NLU, on the contrary, leverages more sophisticated methods such as Conditional Random Fields (CRM). This generalizes better for unseen entities.

This document is further organized in the following chapters: §Related Work reviews other existing ap-proaches for task-oriented dialog agents. §Theoretical Foundation explains the theory behind the models ex-plored. §Memory Networks and Ensemble Learning as Action Policy explains the original contributions of this work in detail, the general design and training data formatting. §Experimental Setup summarizes the research questions the models aims to answer, as well as the score metrics, the different test settings and specific im-plementation details on the models explored. §Results chapter presents the scores and interesting observations obtained in the experiments. Finally, §Conclusions chapter explains what can be inferred from those results.

(8)

Chapter 2

Related Work

Dialog agents are computer models aimed to provide natural language interaction with an user. According to their purpose, these systems are classified in two classes (Su et al., 2016):

1. open domain or chitchat bots: this refers to dialog systems that engage in a natural conversation with the user without a predefined goal. This is a scenario commonly tackled with sequence-to-sequence models that build an entire phrase word by word conditioned on a phrase from the user. Vinyals and Le (2015) is a famous example of this approach, achieving engaging and often interesting conversations with the user. More sophisticated approaches may use a whole ensemble of models to decide what to say. Serban et al. (2017) is a remarkable example of this technique. Chitchat bots however can hardly keep track of any meaningful context in the conversation and are therefore inadequate to perform tasks from the user, since that usually result in a sequence of exchanges or dialog turns, making this highly context sensitive.

2. task-oriented chatbots: these dialog agents are meant to deal with a small domain, specializing to perform just a few set of possible requests from the user and so, their architectures are optimized to keep track of the dialog context. Normally they follow a more straightforward approach and just select the next phrase to say based on this context. This phrase could be a template with free slots that are filled in according to that context, conditioned on the dialog state. Young et al. (2013) review a series of approaches for this sort of statistical agents, stating the problem as a Partially Observable Markov Decision Process (POMDP), i.e. the next action is determined by an estimated current state. All these approaches can be seen as a pipeline of components. Figure 2.1 shows a general diagram for these agents (Chen et al., 2017; Young et al., 2013). Here the user input is processed by a natural language interpreter module (assuming the user inputs audio, which need not be the case). Its output is used to estimate a new conversation state on which a policy will act to determine an action that is finally translated into language.

(9)

2.1 Dialog State Tracking

Task-oriented dialog agents are highly context sensitive: at each turn, their answers need to consider the previous turns in the dialog to make the right choice. In the pipeline architecture from Young et al. (2013), the module in charge of keep all this dialog context is the Dialog State Tracker. In their work, the conversation state is a hidden discrete variable, and its value is estimated at each turn in the dialog. On the other hand, Henderson et al. (2014b) uses Recurrent Neural Networks (RNNs) to keep track of the state and decide the next action, so the dialog state tracker and the action policy overlap.

In the Rasa architecture (Bocklisch et al., 2017), there is an explicit dialog state tracking object containing information such as the current slot values, the past actions taken by the agent and the previous utterances from the user. At every point, the action policy has access of the information contained in this object to decide the next action.

The HCN architecture (Williams et al., 2017) tracks the slot values explicitly, just like Rasa, but dialog state tracking is mostly relied on the memory capacity of their LSTM action policy, just like Henderson et al. (2014b). This is the case in this thesis, since it is based on this architecture.

2.2 Neural Based Approaches

The Hybrid Code Networks architecture of Williams et al. (2017) is a machine learning/rule based hybrid model that excels at the bAbI tasks, achieving state-of-the-art results. This raises the question of whether a fully statistical based approach could fulfill these tasks or not and if so, what must be changed. The experiments ahead will prove this is indeed possible, and that such an approach can excel at the bAbI tasks.

2.2.1 From User Input to Features

By using machine learning methods, then comes a second question on which user input features to compute, and so, several ways to process the natural language input from the user into informative features are tried inthis thesis. Figure 2.1 implies the use of semantic features, often in the form of semantic classes or dialog acts along with their confidence levels. Lee (2013) also uses semantic features to estimate the dialog state with positive results but is not conclusive on whether the success is due to the features, the model or both. On the other hand, Henderson et al. (2014b) explores the use of word-based features (i.e. disregarding the semantic classes from the NLU component, and working with the user input words instead). By using n-grams detected by the Automatic Speech Recognition (ASR) module as features for an RNN, it achieves better results than by using the semantic classes computed from those words. They conclude that this is due to the fact that these word-based features keep the most information from the user input, allowing the model to learn any useful features from it, while using semantic classes inevitably takes information away. However, their results come from the Second Dialog State Tracking Challenge (DSTC2) (Henderson et al., 2014a) which is a highly noisy dataset where keeping all that extra information might be beneficial. This need not be case for less noisy datasets. On this line, HCN also uses word-based features, consisting mainly of Bag of Words (BoW) and word embeddings (Mikolov et al., 2013), achieving the state of the art on bAbI tasks 5 and 6.

User input comes usually as text, mostly due to the use of chatbots on commercial websites, and messaging platforms such as messenger, telegram or slack. This by itself removes much of the noise in the input, and makes semantic features a more attractive option. This is evidenced by the amount of available NLU services such as Microsoft’s Luis, Amazon’s Alexa, IBM’s Watson or the recently created open source Rasa NLU (Bocklisch et al., 2017) (of which Braun et al. (2017) does an extensive comparison). They conclude that Luis performs better overall, but Rasa is not so far behind. This thesis uses Rasa given the advantages of an open source solution, such as knowing the underlying models and allowing any customization needed (e.g. it uses a fully customizable pipeline of models to produce its outputs). There is also a Rasa core module providing an entire framework based on HCN, claiming to be robust under scarcity of data. Their architecture is roughly the same as Young et al. (2013), which can be observed in figure 2.2

2.2.2 Action Policies

Most machine learning approaches rely on some sort of RNN such as Henderson et al. (2014b) or the LSTM used by both Williams et al. (2017) and Bocklisch et al. (2017). Bordes et al. (2016) provides several baselines includ-ing Information Retrieval (IR) techniques, but amongst non rule-based systems, their best (by a wide margin) and most interesting result was obtained by using a Memory Network action policy, as the one from Sukhbaatar et al. (2015). However they do not provide results with semantic features, and the model predicts from among the 2407 utterances ever observed in bAbI task 6 train, development or test data (4212 for task 5). This is a huge amount specially considering that most bot utterances are very similar except for the entities they mention (e.g. ‘here it is resto tokyo expensive thai 4stars address’ and ‘here it is resto seoul expensive thai 7stars address’).

(10)

Figure 2.2: Rasa task oriented chatbot architecture (Bocklisch et al., 2017)

This can be simplified by using templates with slots that can later be filled by a module that keeps track of the entities (such as the NLU). This is in fact what HCN does, resulting in just 58 possible action templates for task 6 and 16 for task 5 (without even using an NLU, but simple key word match on user inputs to detect slot values). This gap leads to the main focus of this work: the Memory Network action policy proved to be promising according to the results from Bordes et al. (2016), but it could be severely hampered by the lack of action templates in the policy. At the same time, HCN does use templates, but they do not use a Memory Network that could potentially improve its performance. For instance, they achieve perfect scores on task 5 but only by using hard-coded rules. This makes it tempting to explore if a fully machine learning based method could also achieve the perfect score, since hard coded rules are expensive to develop and maintain. Moreover, a model could combine the results of HCN’s LSTM policy and the Memory Network, using an ensemble just like Serban et al. (2017), an idea also explored by Henderson et al. (2014a). Finally, none of those two models use an NLU, but rely only on word-based features, that are easy to compute and perform well in noisy dialogs (Williams et al., 2017) but do not work well with out of vocabulary (OOV) terms.

(11)

Chapter 3

Theoretical Foundation

The proposed modified HCN architecture requires several artifacts. Each one of them is explained below.

3.1 Natural Language Understanding

An NLU module takes natural language input from the user and produces a dialog act out of it. Additionally, it can also provide Named Entity Recognition (NER) which is required for slot filling, a crucial part of the dialog agent architectures explored in this work (Bordes et al., 2016). Both aspects are explained below:

• Intent classification: Natural language deals with a huge search space, even under the limited domain of a task-oriented dialog agent. While the user can express an intention in many grammatical ways, there can only be a few such intentions in a constrained domain. Classify this intention means to take a natural language sentence and assign a class to it, as if determining what the user wants the agent to do (Bocklisch et al., 2017). This is therefore a classification task, and the Rasa framework (used in the experiments ahead) does this by computing features representing the natural language input (e.g. word embeddings and Part of Speech (POS) tags). These features are then fed to a classifier, such as an Support Vector Machine (SVM).

• Entity recognition: A task-oriented dialog agent normally requires input values from the user in order to perform a task. For instance, it could require a date before booking a hotel room for the user. This implies that the bot recognizes the existence of the ‘date’ entity, and when the user provides a value for it, the agent needs to detect it. These task is referred to as slot filling (Jurafsky and Martin, 2018) and it can also be regarded as a classification task, but in this case it classifies word sequences instead of single elements. Detecting these slot values or entities is crucial and techniques could be as simple as finding known phrases with regular expressions. Rasa NLU uses a Conditional Random Field (CRM) which can generalize to detect unknown entity values. To do so, it classifies each word using Beginning (B), Inside (I) and Outside (O) labels as explained by Jurafsky and Martin (2018). Table 3.1 shows example IOB labels to train such a classifier. A B label indicates the first word of a slot. If the slot is composed from more than one word (as in south asian, for the cuisine example slot), then subsequent words are labelled with an I label. Words that do not make part of any slot value are labelled with an O. The intent label is present to be explicit in that the same sentences used to train the entity recognizer are usually used to train the intent classifier as well.

Sentence please find a south asian restaurant in London Label O O O B-cuisine I-cuisine O O B-location Intent search restaurant

Table 3.1: Example labelled sentence for intent classification and slot filling using IOB labels

There are many modules available to perform this, like Facebook’s Wit AI1_{or IBM’s Watson}2_{, ready to be}

used even as a cloud service. Further details on how this approach is used in this work are explained in chapter 4.

1_{https://wit.ai/}

(12)

3.2 Dialog State Tracker

The architecture used in this thesis relies mostly on Hybrid Code Networks. It relies on the memory of the action policy plus a separate component to explicitly keep the slot values. In some of the experiments, the action policy is implemented as a Memory Network, which keeps a feature vector for each previous turn. This means the policy relies on itself to check the dialog context that enables the right choices. In later experiments, the policy is implemented by an ensemble of both a Memory Network and an LSTM, hence relying on the memory capacity of those models for dialog state tracking. In all those settings, the entities are always kept by a separate module, that feeds the action policies at each turn and also completes the values of the action templates (e.g. so that an action template such as ‘there are no more restaurants which serve <cuisine >food ’ can be completed as ‘there are no more restaurants which serve thai food ’).

3.3 Action Policy

This section provides a brief summary on the action policies considered in this work, namely LSTM and Memory Networks. Their specific use for task-oriented dialog agents is further explained in chapter 4.

3.3.1 LSTM

An LSTM is a well known recurrent neural model to deal with sequential data. In a dialog, each turn is represented by a feature vector or embedding, which is fed to the network. At each step, the LSTM action policy outputs a vector indicating the distribution over actions. The max of this output vector is regarded as the selected action.

The features representing a dialog turn must contain sufficient information for the action policy to make the right decision. In the HCN architecture, the features representing a turn include information such as which slots are filled, which was the last action from the bot and other features representing the words in the user utterance. The Rasa (Bocklisch et al., 2017) framework, uses its LSTM action policy in a similar way, with the interesting difference that each token in the sequence contains features representing not just the last turn, but a fixed number of previous turns. Although the authors do not explain the reason for this design decision, this way to define tokens could compensate the memory loss suffered by RNNs. Since usually the most recent turns are the most relevant in the dialog, keeping more than just the last in the current turn’s input could alleviate the memory loss.

3.3.2 Memory Networks

The use of Memory Networks for dialog action prediction is a crucial aspect of this work. For this reason, and because they are a relatively recent invention, this section explains the model in detail.

When dealing with sequential data, Recurrent Neural Networks (RNNs) (Goller and Kuchler, 1996) are possibly the most used approach, being the de facto option for sequential problems such as language modeling, POS tagging or machine translation. These well-studied models process the sequence token by token, usually making a decision at each step. Here, Memory Networks (Weston et al., 2014) are a recent alternative, that aims to tackle the tendency RNNs have to forget long range dependencies (Hochreiter and Schmidhuber, 1997). It is important to realize that instead of a single model, memory networks are a family of models, just like RNNs are. This work explores the specific approach to Memory Networks taken by Sukhbaatar et al. (2015), since it is easier to train and more adapted to the dialog domain than the original work on the subject by Weston et al. (2014), and is therefore the one tried by Bordes et al. (2016).

Instead of keeping a single vector memory in charge of representing every token in the sequence so far, a Memory network computes a representation or embedding for each token, and considers them all at once when making a decision. This gives them an equal chance to contribute to the answer. This is not the case with RNNs, since the most recently processed tokens have a higher chance to influence the output. This is due to the fact they all share a single memory which tends to keep less information of the old tokens every time a new token is processed. Figure 3.1 shows a diagram of the Memory Network architecture adopted in this work, proposed by Sukhbaatar et al. (2015).

The input of the model is the current utterance from the user (q) and the conversation history, a sequence of feature vectors x1, ...xM, each one representing the respective dialog turn. Here, M is called the memory size,

a hyper parameter of the model, so that every time an utterance is added to the history, the oldest is discarded so that there are never more than M memories considered. In Sukhbaatar et al. (2015), the way the q feature vector is computed is assumed to be the same in which each utterance in the history was, so that the current q is added to the history for the next turn prediction without further operations. This is not strictly necessary, but it is also the way it is done in this work. An embedding is computed for q and for each utterance in the

(13)

Figure 3.1: Diagram of a Memory Network. (Sukhbaatar et al., 2015)

history through means of embedding matrices B (d × h) and A (of same dimensions as B): u = qB

mi= xiA

Each of the M memory embeddings is compared with u by a dot product. A softmax is then applied to produce a simplex that scores each of the candidate memories. A final fixed size memory representation is obtained by computing other M embeddings of the history, c1, ...cM by using a matrix C of same dimensions as A and B.

The final embedding o is the average c, weighted by the vector of weights p:

pi= mi.u P0 imi.u ci= xiC o =X i ci.pi

From here, all that is pending is to map this to the desired output space, which is the number of actions V . Here the author proposes to take the sum o + u and use a simple mapping to the V space, simply by using a matrix W (h × V ). But the author also reports having better results by applying another embedding to q with a matrix H(d × d), therefore this is the alternative chosen in this work.

The model performs well up to this point, but the author reports performance gains on several tasks by applying a recursive step called a memory hop. This is akin to the model reconsidering and refining its answer. Instead of applying the final transformation with matrix W , using the output o + q (or o + qH), a hop implies using this term as the input for another round through the entire model, repeating this recursive step a fixed number of times and applying the final W transformation only at the last step. This multiplies the number of parameters of the model, leading to the question of which ones to share or constrain. The authors reports increased performance in many tasks such as language modeling with up to 6 hops. Regarding number of parameters, this work adopts the so called ‘adjacent’ approach from Sukhbaatar et al. (2015), which means all matrices are the same across hops, except A and C. In the proposed variant, there are as many A matrices as hops, and in hop j, Cj = Aj+1. This is the same approach taken by Bordes et al. (2016). Figure 3.2 shows a

diagram of the recursive operations taken in a Memory Network.

The history is the same across hops (and the number of hops is constant for the model, not dependent on the current turn). In this diagram, matrix H is omitted, and no constrain is explicitly enforced on any matrix. Do note how matrix W is applied only with the output of the final hop. Only then the final answer ˆa is computed.

3.3.3 Policy Ensemble

Using several action policies to predict an answer and leveraging all their predictions to compute a better one is what makes an ensemble, and this has been proven beneficial in many domains, including dialog agents

(14)

Figure 3.2: Diagram of a recursive Memory Network with 3 hops. (Sukhbaatar et al., 2015)

(Henderson et al., 2014a; Serban et al., 2017). Given a set of M models, each one computes a prediction in the form of a categorical distribution over the available actions. Let pi be the prediction of model i for any

given turn in the dialog. The immediate question is how to leverage the knowledge in each one of the M such distributions to compute a final better answer. Henderson et al. (2014a) proposes 2 methods, namely score averaging and stacking, which are explained below along with the simple highest confidence approach.

Highest Confidence

The most straightforward solution. Let the model with the highest confidence decide: action = max(max(p1, ..., pM))

In the case where one model is right more often than the rest or if some subset of models complement each other by being correctly confident where the others are not, this approach should improve overall prediction accuracy.

Average Prediction

An even smarter and similarly simple approach is to take the average across all predictions and use that as the final categorical distribution:

action = max( 1 M M X i pi)

As defended by Henderson et al. (2014a), as long as each model’s predictions are correct more than half the time and their errors are not correlated, average prediction is guaranteed to improve performance. Using different models encourages decorrelation but using different training data is a simple way to help decorrelate as well.

Stacking

Instead of deciding what to do with each prediction, yet a separate model can be used to learn from those predictions. This idea comes from Wolpert (1992) who calls it stacked generalization.

action = f (p1, ..., pM)

Where f is a learned function.

This approach has the advantage that potentially, it can learn whatever features are required to get the right answer given the independent predictions of each underlying policy, therefore getting better results than averaging, as it is the case in Henderson et al. (2014a) but it requires an extra dataset to train this separate model.

(15)

3.4 Natural Language Generator

In open domain chitchat bots, the Natural Language Generator is a complex task, requiring to model an utterance posterior probability given (at least) the dialog estate (Vlad Serban et al., 2015), a task often solved through generative language models like Sequence to Sequence (Vinyals and Le, 2015). On the contrary, task oriented dialog agents do not need to produce creative answers, but to fulfill a given goal only. Therefore, generating an answer is something that often does not take much more than the policy deciding a message template with unfilled slots (e.g. Hotel La belle ville is in <city> and then filling those slots with the values tracked in the Dialog State Tracker module before outputting the answer. This approach is used by many authors such as Williams et al. (2017); Bocklisch et al. (2017); Vlad Serban et al. (2015) and it is the one used in this work as well.

(16)

Chapter 4

Memory Networks and Ensemble

Learning as Action Policy

This chapter goes deeper into the core contributions of this thesis. Section 4.1 lists and explains the features compared in the experiments. Section 4.2 explains how a Memory Network can be used as action policy for a task-oriented dialog agent, highlighting its differences with an LSTM. Section 4.3 continues with how to build an ensemble of action policies with the Memory Network and the LSTM. Finally, section 4.4 covers the data used and how it was formatted to be compatible with the architectures used in this thesis.

4.1 Input Features for the Action Policies

One important contribution of this work is the design of task specific user intents and the use of an NLU to compute this feature and add it to the HCN architecture. This feature is directly compared to the default features from HCN. While the HCN architecture uses word-based features, user intent is a common feature provided by most popular NLU cloud services (Braun et al., 2017). Rasa NLU (Bocklisch et al., 2017) is the one chosen for the experiments performed in this work, because it is completely open source, highly customizable and effective as can be seen in chapter 6. An advantage of user intents is that they could be common in different domains, increasing the availability of training data. Another potential advantage is it is less noisy than word-based features, making it easier for an action policy to learn from them. This work explores the value of using an NLU to get the user intent feature in comparison to the HCN word-based features, as well as the NLU entity recognition in comparison with HCN regular expression based pattern matching. The following list explains the features used in the experiments ahead and how to compute them. Most of the list consists on features taken from the HCN architecture, while the user intent and the way entity flags are computed are direct contributions of this thesis.

• User intent: this is implemented as a vector whose size equals the total number of intents defined. In this work, this feature is computed by Rasa NLU by using labelled training data to feed an SVM as intent classifier. The exact set of intents to use is domain dependent and it is a design decision. Section 4.4.2 explains the intents used on each task explored in this thesis.

• Entity flags: these features consist of binary flags, indicating the presence of an entity in an utterance. They could either be provided by Rasa NLU (using a CRM) or by using regular expressions to detect known phrases in the utterances. This is the approach used by HCN. Both options are compared in the experiments.

• Bag of Words: a binary vector of vocabulary size, plus one extra bit, to accommodate for unknown words during test time. Each entry corresponds to a word in the vocabulary and is set to 1 if the corresponding user utterance includes that word. This is easy to compute, with the major drawback of losing word order and disregarding word similarity (e.g. a test input could be very similar to a train input, except for a word replaced by a synonym. The BoW approach would disregard this similarity). Another major drawback is that it performs poorly with unknown and low frequency words.

• Word Embeddings: these features fix most flaws from BoW features, since now each word is represented by a vector holding semantic information. There are several implementations from word embeddings, HCN uses those from Mikolov et al. (2013). It specifically uses the Google News 300 dimensional embeddings trained with 3M vocabulary size 1. The sentence embedding is simply calculated by averaging the word

(17)

embeddings. This approach still loses word order but the impact it has in a task oriented domain is negligible, as opposed to an open domain full of nuances and sentiments. The word embeddings along with the BoW features are commonly referred to as ‘word based features’ in this work.

• Turn: An integer representing the index of a turn in the dialog. This feature is specially important for a Memory Network action policy, since otherwise it doesn’t consider the order of its memories, unlike an RNN. Sukhbaatar et al. (2015) even propose to learn a time-embedding matrix, a matrix with one row for each turn index that can be considered. This time embedding is added to the input features and learned during training. In this work, an integer is preferred since it is sufficient to represent the order information, while the benefit of time embeddings is not justified by the authors. It also puts a limit on the number of turns that can be considered in a dialog, as the time embedding matrix needs to add rows accordingly. Another design decision concerning the turn feature is whether to encode it as a binary number or integer. The latter approach introduces a bias when the action policy learns with back-propagation, since higher numbers obtain a training boost. But since in a dialog, higher numbered turns (i.e. more recent turns) usually have more influence in the dialog context (this is studied in section 6.2.3), this bias can be beneficial.

• Bot previous action: a 1 hot vector indicating the previous action of the dialog agent. This action could be the ground truth last bot action or the actual last predicted action, with significant consequences in performance that are explored in the experiments. For more details about these two possibilities, see section 5.2.2.

• Context flags: The HCN architecture includes special context bits in the input, which are meant to provide context about the dialog. These features are highly dependent on the restaurant booking domain of bAbI task 5 and task 6. The complete list of such flags is:

1. presence of each entity in the dialog state. Do note this is different from the features that indicate presence of each entity in the current utterance, thus HCN uses two sets of features dedicated to entities. This is the only context feature that HCN uses for bAbI task 5, while task 6 uses all the context features on this list.

2. whether the database has been queried yet. It is not used in this work since it contains almost the same information as the ‘results presented’ context feature explained below, except for marginal scenarios where the database has no results for the given query.

3. whether there are any restaurants to offer given the current filters. Williams et al. (2017) implements this as two binary flags to represent the two possibilities as 1 0 and 0 1.

4. whether any results have been presented so far.

5. whether all results have been exhausted for the current query (this one uses two bit flags as well). 6. whether the cuisine type is unknown (which happens often in the bAbI dialogs).

7. whether the query yields any results in the training set. In this work, this feature was disregarded since it proved to add little to no value during hyper-parameter optimization, arguably because of the differences between the train and test bots.

This thesis puts the HCN word based features in comparison with the NLU provided user intent feature. The rest of the features are common to both settings.

4.2 Memory Networks as Action Policy

The main goal of the action policy is to decide the next action. Recurrent Neural Networks are a popular choice because their memory capacity enables them to fulfill both state tracking and action policy tasks. Memory Networks also possess memory capacity, which in some tasks, such as Q&A or language modeling, has achieved state-of-the-art results, even outperforming LSTM based approaches (Sukhbaatar et al., 2015).

Inspired by Bordes et al. (2016), this work proposes the use of a Memory Network as action policy. The approach presented here differs however in two crucial aspects.

1. Action templates: Bordes et al. (2016) considers every possible bot utterance int their domain as an action to classify on, which results in thousands of actions for their experiments. The model presented here considers action templates instead, just like Williams et al. (2017). An action template is a bot utterance with some slots to be filled in after the prediction is performed. This results in just tens of possible actions for the same experiments used by Bordes et al. (2016). Therefore, the Memory Network has an easier task at hand which allows it to perform even better.

(18)

2. NLU input processing: While other authors make use of word-based features such as Bag of Words or word embeddings (Williams et al., 2017; Bordes et al., 2016), this thesis experiments on the effect of using a Natural Language Understanding module to process the user raw text input and produce a semantic feature, namely a one-hot vector indicating the user intention. This intent vector is easier for the action policy to deal with, albeit losing some information. The posterior experiments try these effect in both, artificial toy scenarios and real human-robot conversations with high noise conditions.

Other than the above, the usage of a Memory Network for this task is the same as in Bordes et al. (2016). That is, every conversation turn is converted into a fixed size vector of features. At every turn the network keeps track of the history of such vectors, up to a fixed length. These history is composed of the xi sentences in

figure 3.1. The user utterance at the current turn corresponds to the question q. Consider for instance that at a given turn, q is a vector of features representing the utterance ‘I want a restaurant please’. If the user has not provided any slot values yet, this will be visible in the history. The network can conclude from here that the best next action is to ask for a slot value, by asking for instance ‘what kind of cuisine would you like? ’. Otherwise, if enough information is present in the history, the policy can conclude that the best action to perform is search for a restaurant and offer it to the user.

The history of previous turns plus the current turn vector of features is the input of the Memory Network at each turn. Its output is then an action from the set of action templates. These templates are strings of text with open slots to be filled by another module that specializes in entity recognition and does not care about action prediction (such as Rasa NLU NER or the regular expression based pattern matching technique from HCN).

4.3 Building an Action Policy Ensemble

A Memory Network is just one potential action policy, just like an LSTM. Once there are several models, assuming they have different knowledge (that is, the predictions perform sufficiently well and their errors are not fully correlated), their knowledge could be combined to make a final prediction. This approach has been used successfully for open domain dialog agents (Serban et al., 2017) and also for dialog state tracking (Henderson et al., 2014a). This thesis proposes to use it for task-oriented dialog agents. It explores the three ensemble approaches explained in section 3.3.3, using the Memory Network as proposed in section 4.2 and the LSTM as used by Williams et al. (2017) as action policies. For any of the three ensemble approaches, the input is always the output prediction of each policy, in the form of a vector with the softmax over available classes. For the stacking ensemble, a Multi Layer Perceptron (MLP) is used as classifier, which is in line with Henderson et al. (2014a) and is a popular choice for simple classification tasks such as this one.

4.4 Designing and Extracting Intents and Actions

This section explains important contributions from this work regarding the way it used the bAbI datasets. The exact set of actions predicted by the action policies, as well as the user intents classified by the NLU require important design decisions regarding the dataset used. Before explaining these decisions, the bAbI tasks are properly introduced to better understand the data format and main characteristics.

4.4.1 Dataset Description

The bAbI tasks

The bAbI tasks (Bordes et al., 2016) is a well known dataset aimed to train, test and compare task oriented dialog agents, produced by Facebook AI Research2with many benchmarks available. It is composed by a series of 6 tasks with increasing levels of complexity, making it a popular choice used by authors such as Williams et al. (2017). Each task tests a specific aspect of dialog, and builds on the previous tasks in an incremental way. For instance the first task is meant to test the ability of a system to learn to issue API calls when required, the second task focuses on the ability to update the API calls if the user changes her mind and so on. Task number 5 tests full dialogs, effectively subsuming the previous 4 tasks and therefore, both HCN and this work focus on it and on task 6, a similar but much harder task explained ahead.

Each task consists of a training, validation and test set. Each set is composed of dialogs, which are lists of user-bot utterance pairs. They are all in the domain of restaurant booking, where the user has a specific need in mind (e.g. thai food in the center of town). The bot must obtain these query filters (also called slot values or entities) in order to perform the query on a restaurant database and offer the results. For task 5, the slot

(19)

Task T5 T6 Average num. utterances: 55 54 Average user utterances 13 6 Average bot utterances 18 8 Vocabulary size 3747 1229 Different bot utterances 4212 2406 Train set dialogs 1000 1618 Dev. set dialogs 1000 500 Test set dialogs 1000 1117

Table 4.1: Statistics for bAbI tasks 5 and 6

values that the user can provide to filter the database are type of cuisine, location, number of people and price range. For an example raw dialog format example, see appendix C, figure C.2.

There is an extra Task 5 test set with out of vocabulary (OOV) entity values, that is, a set of dialogs where the user mentions entity values that were never seen in the training set. To produce this OOV test file, the authors split all cuisine types and locations in half. Training, development and test set use only restaurants from one of those halves, while the test OOV file uses values only from the other.

Given the simplicity of this dataset, it is desired to see how a model would perform in a more realistic environment. To this end, many authors recur to the bAbI task 6 (Williams et al., 2017; Bordes et al., 2016), which is an adaptation from the Second Dialog State Tracking Challenge (Henderson et al., 2014a) to the bAbI tasks format. This task also deals with the restaurant booking domain.

Table 4.1 summarizes relevant statistics about bAbI tasks 5 and 6.

The Second Dialog State Tracking Challenge

This challenge was originally created with the purpose of testing how well can a given model predict the state of a conversation at each turn. To this end, 3 simple bots with different levels of complexity were created to have phone conversations with human users pretending to have an actual need of a restaurant.

The DSTC2 dialogs have many sources of noise, including the actual noise on the phone line. In fact, conversations were recorded under different noise levels on purpose. For some extreme cases, even the human transcribers could not decide what the message is. Added to that, the bots handling the calls are rather simple, relying in basic probabilistic models to track the conversation state and hand-crafted action policies (only the bot from the test set used reinforcement learning in its action policy, making the distributions governing the test dialogs different than those from the training and validation sets), and very often, the bot misunderstands the user message, but the conversation keeps carrying the mistake nonetheless, posing an important challenge for any model trying to predict the bot side.

The conversations where thoroughly annotated with a wide range of meta-data such as transcripts, user provided quality assessment, semantic annotations on both the bot and human utterances, as well as the Automatic Speech Recognition hypotheses on the words uttered by the user. The actual challenge gathered 31 contestants (from 9 research groups) so that their models could learn to estimate the conversation state at each turn, defined as the current value of each slot, the currently requested slots (such as phone number or address of the restaurant) and the currently desired search method (which roughly equals to the user intent as defined in section 4.1). There are other Data State Tracking Challenges, like the original Dialog State Tracking Challenge (Williams et al., 2013), which focuses on the bus timetable domain with easier dialogs (e.g. users cannot change their mind), or the third challenge, which uses the same training data of the second but provides a test set on the tourist information domain, to explore domain adaptability. There are yet other challenges, up to the sixth, exploring more and more complex aspects including switch of language at test time. The Second DSTC is used in this work since it is the most appropriate to the problem at hand, it is freely available, it has been already converted to the bAbI format and it has been used by other authors, including the ones that this thesis focuses the most on (Bordes et al., 2016; Williams et al., 2017).

The dialogs are formatted into json files, having one file for the human side and another for the bot. An example fragment showing the file format is available in appendix C, figure C.1.

4.4.2 How to get User Intents and Actions from the bAbI Tasks

This section describes the design decisions made in this thesis to choose which intents to define for each task and how to get the training data for the NLU to detect them, as well as how to define the bot action templates for the action policy to classify on.

(20)

Intent Templates Count Example greet 3 1000 good morning

inform 8 4998 i am looking for a cheap restaurant deny 4 2404 no i don’t like that

affirm 4 1000 it’s perfect

request phone 3 755 do you have its phone number request address 3 779 do you have its address thankyou 3 1000 thanks

bye 2 1000 no thanks silence 1 5404 <SILENCE>

Table 4.2: Intent statistics for bAbI task 5 observed in training data (development and test data have the same intents, and the number proportions are similar). The silence special utterance is used for turns in which the bot says something that requires no input from the user, such as in line 8 of figure C.2

Act Count Template

greet 1000 hello what can i help you with today on it 1000 I’m on it

ask location 512 where should it be

ask number of people 480 how many people would be in your party ask price 494 which price range are looking for

announce search 2000 ok let me look into some options for you

api call 2000 api call <cuisine><cuisine><location><number><price> request updates 2018 sure is there anything else to update

suggest restaurant 2404 what do you think of this option: <restaurant name> announce keep searching 1404 sure let me find an other option for you

reserve 1000 great let me do the reservation give phone 755 here it is <phone>

ask anything else 1000 is there anything i can help you with bye 1000 you’re welcome

ask cuisine 494 any preference on a type of cuisine give address 779 here it is <address>

Table 4.3: Bot dialog act statistics for bAbI task 5. Counts come from training set, templates remain the same across train, dev or test sets

Task 5 and Task 5 OOV

In these dialogs the user tries to get a restaurant by providing up to 5 slot values (namely, cuisine, number of people, price and location), potentially requesting either the phone or address once an option is offered. The user can also change her mind during the conversation and the bot needs to accommodate for this. The utterances from the user are also very simple since these conversations were artificially produced. Thus, both the bot and human utterances do not have much diversity. In fact, rule based systems can effectively excel at these tasks, but an important goal in this work is to succeed at these tasks without relying in anything else other than machine learning.

By templatizing every user utterance, just 31 possible utterance templates are obtained (e.g. 3 ways to greet, namely ‘good morning’, ‘hello’ and ‘hi ’). This is valuable knowledge for the NLU, as all those 31 templates can be further grouped in just 9 intents. Table 4.2 shows the intent classes defined for this dataset as part of the specific model design used in this thesis. Do note this intent classification is not enforced or suggested in any way by the authors of the dataset. But given the lack of grammatical diversity, and the context in which the utterances are used, any author having to classify those utterances by intent could hardly get a different classification. Table 4.2 lists the intents for task 5.

The bot phrases are also limited and are very consistently used on each context. They can all be represented with 16 templates as can be seen in table 4.3

Task 6

This task is the result of taking the DSTC2 dialogs and putting them in the bAbI format. The authors explain further that the training and development set split is not the same as in DSTC2, but the test set is preserved. Furthermore, the utterance diversity both on the human and bot side is significantly greater than that of task 5, requiring more effort and creativity to define user intents and bot actions.

(21)

Figure 4.1: Total occurrences of each bot act in bAbI t6 test set

Figure 4.2: Frequencies of each bot act in bAbI t6, both for test and training set

Bot action templates

Since there were different bots in each set, templates seen in train, development and test data are not entirely the same, but they can be easily grouped together based on their meaning. By doing that, 56 templates were identified. This is an easier set for the action policy to work with, instead of all the 4212 possible bot utterances used by Bordes et al. (2016). Some of these templates are very similar to others, like. ‘I’m sorry but there is no restaurant serving <cuisine >’ and ‘I am sorry but there is no <cuisine >restaurant that matches your request ’. Such subtle changes might be due only to the fact that a different bot produced each, and that would constitute a latent variable that is not explicitly accounted for in the models explored in this work or by other authors such as Williams et al. (2017). The complete list of templates is available in appendix A. Figure 4.1 shows the total occurrences of each bot action template in bAbI task 6 test set.

Since the bot generating the utterances of the test set is different than those of the train set, it is interesting to see how the distributions vary across datasets. Figure 4.2 shows both frequencies side by side. Both distributions are reasonably similar, but the training set is more skewed towards the dominating 3 acts, while others appear only marginally or not at all (note for instance the huge gap of the repeat act, which almost never appears in the train set, but it does much more often in the test set. This one is very hard to predict not only because of this gap but mostly because it has much to do with the output of the ASR in the bots generating the data, and this information is not available for the bAbI task 6. This is just one more reason why task 6 is so much harder than task 5.

To have a better idea of how different the training and test distributions are for bAbI task 6, one can use task5 as a baseline. Figure 4.3 shows this frequency comparison for task 5.

For bAbI task 5, the training distribution is extremely similar to that on the test set. A common measure of how similar two distributions are is the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951), which reports 0 for two identical distributions and 1 as the maximum divergence. Task 5 reports a KL(train, test) of 0.0004, while that of task 6 is 0.3658, a difference of several orders of magnitude.

User acts

Since the utterances were produced by actual humans, the diversity is too great to handle with templates as done with the bot side, and templates from the training set would not be of much use in the test set. For-tunately, DSTC2 dataset is provided with semantic annotations. Every utterance in that dataset has semantic

(22)

Figure 4.3: Frequencies of each bot act in bAbI t5, both for test and training set

Intent Description affirm affirmation bye dialog end cue

dontcare indicate the answer to question just asked is irrelevant

inform provide slot values for a query negate anwer with a negation

reqalts indicate the offered option is not desired and ask for a new one

request address request the address of the offered restaurant request food request the food type of the offered restaurant request location request the location of the offered restaurant request phone request the phone of the offered restaurant request postcode request the postcode of the offered restaurant request price request the price range of the offered restaurant

silence used in bAbI format for turns that require no input from the user (e.g. bot suggesting a restaurant right after an api call is considered a new turn with no input from the user) unknown any utterance not matched by the rules

Table 4.4: User intents extracted from bAbI task 6, by using word matching rules on all user utterances from training set. The rules were determined according to what can be served from the bot side

labels provided by Amazon Mechanical Turk, from 13 possible options. Each such utterance can have more than one label. For instance, the utterance ‘ah yes im looking for persian food and i dont care about the price’ is annotated with labels ‘affirm’, ‘inform(food=persian)’ and ‘inform(price=dontcare)’. Since the NLU archi-tecture is designed to assign a single intent to each utterance, every possible combination observed in DSTC2 training set was analyzed and mapped to a single intent. But this by itself still does not answer the question about which intents to define. To answer that, all the possible bot template answers were considered, so that every user intent defined can be acted upon by the bot. Every other possible user utterance is assigned to the ‘unknown’ intent. Table 4.4 lists the intents used for task 6.

To detect entities in the user utterances to produce training data for the NER component of the NLU, regular expressions were used to detect the known values for each entity type. Do note that Williams et al. (2017) uses these regular expressions as their entire NER component, but in this thesis, regular expressions are only used to gather the NLU training data, which should be able to deal better with unknown entity values. The complete list of regular expressions used is available in appendix B.

(23)

Chapter 5

Experimental Setup

This section establishes the research questions that this work intends to answer. For each such question, this section provides specifics about the test environment, such as metrics, conditions and implementation details.

5.1 Implementation Details

This section describes technical specifications of the models developed as part of this work. For a description on the feature sizes see table 5.1.

5.1.1 NLU

Rasa NLU works as a customizable pipeline of models. Each model computes features and takes as input the user utterance as well as any features computed by models earlier in the pipeline. The default Rasa NLU pipeline (as of version 0.11.3) was used. This pipeline uses GloVe embeddings (Pennington et al., 2014), n-grams, Part of Speech (POS) tags and synonyms as its main features. It then feeds them to the sklearn SVM implementation using a linear kernel to classify intents and grid search to optimize the C paramenter and sklearn crfsuite1 _for

entity tagging.

5.1.2 Memory Network

The model is an original implementation in tensorflow 1.6.0, for the ‘adjacent ’ architecture as described in Sukhbaatar et al. (2015). Dropout was used in the last layer as regularization since it proved to increase performance slightly. Early stopping with patience parameter 5 was used. That is, stopping training after 5 epochs with no increase in validation set performance, with accuracy as the chosen metric of performance.

The development set was used for hyper parameter tuning. The input dimension depends on the type of features used and whether the bot previous utterance is added to the input or not. Table 5.1 lists the contribution of each parameter to the input dimension. Hyper parameter values are listed in table 5.2.

1_{https://sklearn-crfsuite.readthedocs.io/en/latest/}

Feature Setting Task size

Entities in current turn Both Both 4 (task 5), 3 (task 6)

Turn Both Both 1

Intent NLU Both 16 (task 5), 14 (task 6) Previous action Both (offline test) Both 16 (task 5), 56 (task 6) Bag of Words HCN Both 85 (task 5), 523 (task 6) Word embeddings HCN Both 300

Context HCN 6 1 (task 5), 9 (task 6)

Table 5.1: Dimensions each type of feature adds to the input data both for LSTM and Memory Network policies. Column ‘Setting’ indicates if the feature is used only by the HCN architecture, only by the alternative NLU based setting proposed in this thesis or by both, while column ‘Task’ indicates if the feature is used in task 6, 5 or both. Do note HCN reports 14 context features while this table reports 9. The difference is due to the fact that HCN counts the entities in current turn in those 14 and the fact they use 2 features to denote if the database has been queried, while that feature is not used in this work as explained in section 4.1

(24)

Hyper parameter Value hops 2 embedding size 100 batch size 32 mem size 9 epochs <35 * gradient norm clip 1 keep probability 0.8 weight initialization N (0, 1)

optimizer Adam, lr=1e-4, e=1e-8 error function cross-entropy

Table 5.2: Hyper parameters for the Memory Network policy for both bAbI tasks. Values were optimized according to accuracy on the development set, using early stopping (patience 5) to determine number of epochs * depends on setting due to early stopping. Always below 35

Hyper parameter Value hidden size 128 gradient norm clip 1 batch size 1 turn

optimizer AdaDelta, lr=0.1 epochs <35 *

error function cross entropy weight initialization Xavier with Uniform

Table 5.3: Hyper parameters for the LSTM policy for both bAbI tasks. Values were optimized according to accuracy on the development set, using early stopping (patience 5) to determine number of epochs

* depends on setting due to early stopping. Always below 35

5.1.3 LSTM

The model is an original tensorflow 1.6.0 implementation, based on the implementation details provided by the HCN authors (Williams et al., 2017), except for the action mask, since this work focuses on fully neural action policies with no domain specific rules. The input is first projected to the ‘hidden size’ space. Gradients are computed every turn with full unrolling to the very first turn during the entire conversation (that is, batch size is 1 dialog turn). This is feasible thanks to the fact there are never more than 50 turns. Input data is the same as in table 5.1, hyper parameters are listed in table 5.3.

5.1.4 Stacking Ensemble

The ensemble uses the outputs of both Memory Network and LSTM policies as input and outputs a final categorical distribution over actions. From the possible ensembles described in section 3.3.3, stacking is the only one that requires further explanation on the implementation details. The input is simple enough, being the concatenation of the outputs of both policies. Since the policies are expected to agree very often, learning to find the argmax of both distributions is not difficult for the stacking ensemble. Just like Henderson et al. (2014a), it is implemented as Multilayer Perceptron using Keras 2.1.5 with tensorflow backend. It consists of 4 dense layers with ReLU activation except the last one, that uses softmax. Loss function is categorical cross-entropy and the optimizer is RMSProp . Since the model should be trained with data different to the one used to train each policy in it, the exact topology and hyper parameters were determined by training on the same training data and using the development data for hyper parameter tuning. Once the topology is fixed, it was trained from scratch using the development data.

5.2 Test Conditions

5.2.1 NLU in Isolation

This is one of the three main areas of interest of this thesis. Since this module is the entry point of the dialog agent architecture, it is convenient to check its performance in isolation first, to lay a reliable foundation for subsequent experiments. Then following sections deal with end to end test of the dialog agent, which imply a comparison of the NLU against other type of features from the user input.

(25)

NLU Training Data Task 5

Remember this task consists of simulated conversations with very limited grammatical diversity. It is then possible to identify the intent of every sentence observed with simple regular expressions. One such rule was defined to classify each utterance into one of the intents presented in table 4.2. These rules work equally well in the train, development and test sets.

For entity recognition, Rasa NLU requires example sentences where the entities are marked (i.e. indicating character start and end). Regular expressions where used to detect known entity values in the train set. Task 5 OOV however includes values that were never seen in the train set, so detecting those relies completely in the NLU generalization capacity.

36797 Training examples were obtained from task 5 train and development sets. Task 6

The greater variety of task 6 makes it unfeasible to apply the same approach from task 5. Instead, intents were obtained from DSTC2 semantic annotations, as explained in section 4.4.2, obtaining examples for each of the intents in table 4.4.

For entity recognition, the approach was identical as with task 5, since there is no fundamental difference in this regard.

14186 train examples were obtained from DSTC2 train set.

NLU Performance

Task 5 is too simple for an NLU since the grammatical structure of the sentences is well known. Therefore, for this task, this component will only be tested end to end. For task 6, the DSTC2 semantic annotations from the test set were used as the ground truth. The NLU is tested for the usual classification metrics of precision, recall and F1 score.

5.2.2 Policy Performance

This is the original question motivating this work. Being a relatively recent model, the literature is yet to discover most of Memory Netorks potential. This thesis intends to see how well a Memory Network performs in the common benchmark of the bAbI tasks, by using a similar architecture to HCN, putting it in direct comparison against an LSTM policy. Then another interesting question that arises is: can a Memory Network and an LSTM trained on the same data, benefit from the knowledge of each other? To this end, the ensemble is also tested, as yet a new policy.

The Memory Network is tested both in task 5 and 6 to check if there is any significant difference in perfor-mance under different levels of noise and complexity. The LSTM policy is tested in the same settings to allow for comparison.

The following subsections explain the metrics used to test this performance and the specific conditions of the tests.

Evaluation Metrics

This is a typical classification setting. There are many common metrics for these problems, such as precision, recall, accuracy or F1 score. This work uses turn accuracy and percentage of perfect dialogs, allowing direct comparison with Bordes et al. (2016) and Williams et al. (2017). Turn accuracy is broadly used in much of dialog agent research including dialog state tracking (Young et al., 2013; Henderson et al., 2014b).

Turn accuracy refers to the fraction of correctly predicted actions from every turn in every dialog, expressed as a percentage. Perfect dialog accuracy refers to the fraction of perfectly predicted dialogs (i.e. a dialog where every turn was correctly predicted), also expressed as a percentage.

Other conditions Online vs Offline Test

The previous bot action is a very important feature since only this and the context features provide the policy with dialog state information to determine the next action. This feature is used by the HCN architecture and the models explored here. There are three possible test settings regarding this feature:

1. Offline testing: the ground truth prediction, as present in the test set, is added as a feature for the next turn. This is the approach used to compare against Williams et al. (2017) and Bordes et al. (2016), although it is unrealistic since such ground truth cannot be available in an actual live test scenario.

(26)

2. Online testing: actual predictions are used as the features, so errors accumulate and degrade perfor-mance.

3. No previous action features: a compromise between the previous 2 approaches. Since having the previous turn information is troublesome for realistic scenarios, it is interesting to know how would models perform without this information at all.

Literal match vs act match

In both task 5 and 6, the policies predict an action in the form of a message template. A fully formed message is produced after the template slots are filled with the entity values tracked either by the NLU or the pattern matching rules. This yields two possible test settings:

1. Literal match: the model predicted action will be filled with the tracked slot values to produce a fully formed message. This message is compared against the ground truth message, character by character and only perfect matches count. This effectively tests both the right prediction and the entity tracking. 2. Act match: since entity tracking can fail (specially for task 6), it is interesting to separate errors produced

by this from those produced by a bad prediction from the policy. Act match comparison consists in compare only the action predicted (i.e. the bot act) and not the fully formed message.

Comparison between HCN and NLU Features

To see if the user intent feature computed by the NLU is more informative than the word based features used by HCN, the user natural language input will be processed in both ways and the experiments will be performed with both possible feature sets. For a fair comparison, the ‘NLU features’ will be comprised not only of the user intent, but also of the dialog context and entity flag features defined in section 4.1 (one exception for task 5, where context flags are not used in any case, in line with Williams et al. (2017)). The exact features and their sizes for both the NLU and the HCN feature settings are explained in table 5.1

Task-Oriented Dialog Agents Using Memory-Networks and Ensemble Learning

MSc Artificial Intelligence

Master Thesis