source awareness in end to end task oriented dialog systems

(1)

Source Awareness Memory End-to-End

for Task-oriented Dialogue Learning

Bachelor thesis Credits: 18 EC

Tyler Cools

11004851

Supervisor

Jiahuan Pei

BSc Artificial Intelligence University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam

(2)

Abstract

Task-oriented Dialogue systems are widely used nowadays. They aim to make human life easier by facilitating answers to their questions asked to the sys-tem. This paper uses data of restaurant bookings. Performance improvements of Task-oriented Dialogue Systems are reached by using End-to-End networks. Whereas the traditional method uses handcrafted rules and is domain specific, End-to-End networks learn from previous data and can thus be scaled to mul-tiple domains. Despite the many advantages this network offers, the system still struggles with retrieving the right information from the Knowledge Base. This problem is addressed by using Source Awareness on the existing Memory End-to-End network. This technique splits up the dialogue and is an important extension because it enables the model to store data more efficiently. This en-sures better use of the memory’s attention. The experimental results show that the model outperforms the baseline model on almost every task. The mistakes the system makes are analysed to get a good overview of the model’s strengths and weaknesses.

(3)

Acronyms

DST Dialogue State Tracker. 8, 9

DSTC2 Dialog State Tracking Challenge 2. 15, 16, 22 GMemN2N Gated Memory End-to-End. 24

KB Knowledge Base. 4–6, 9, 15–17, 21–23

MemN2N Memory End-to-End. 1, 3–5, 10, 13–19, 21, 23–27 N2N End-to-End. 4, 5, 9, 10

NLG Natural Language Generation. 8 NLU Natural Language Understanding. 8, 9 NUC Next-Utterance-Classification. 10 OOV Out-Of-Vocabulary. 15–17, 21–23 PL Policy Learner. 8, 9

QA Question Answering. 4

RNN Recurrrent Neural Network. 4, 9 SA Source Awareness. 4, 5, 14, 21

SAMemN2N Source Awareness Memory End-to-End. 1, 5, 13, 16–18, 21–24 TDS Task-oriented Dialogue System. 4–6, 8, 10

(4)

1 Introduction

1.1 Motivation

In recent years, making reservations or appointments by telling a computer program to do so is a common thing to do. This is done by Task-oriented Dialogue Systems (TDSs). A dialogue is a conversation between two or more agents, which in a TDS are a computer and a human. A TDS is a computer program that communicates with a human in a natural way and is widely used in virtual assistants like Apple’s Siri or Google’s Assistant (Arora et al. [1]).

The traditional method to construct a TDS is the pipeline method, for which extensive human effort is necessary because the rules are handcrafted. In the last few years, neural networks came up to be a solution to this problem. End-to-End (N2N) networks only use one module and learn from given trainingdata to find a pattern in the dialogues.

Weston et al. [17_{] concluded that despite the fact that N2N networks use} previous data to produce an output, their memory is typically too small and is not compartmentalized enough to accurately remember facts from the past. This makes neural networks hard to use in Question Answering (QA) in which a lot of inferring is used. Zaremba and Sutskever [21] showed that a Recurrrent Neural Network (RNN) has difficulties with outputting the same sequence as the input they just read.

Another problem arises when using N2N networks to compose a TDS. Be-cause the pipeline method works extremely well on domain specific tasks, the N2N network not only needs to achieve high accuracy over all domains, but also beat the traditional model on the domain specific data. This results in an increase in the amount of data which makes it harder for a neural network to process it all.

In order to resolve the above discussed problems, Sukhbaatar et al. [16] demonstrated a promising method which uses a novel RNN based on the Mem-ory Network implemented by Weston et al. [17]. Whereas this network model was not easy to train using backpropagation and required supervision at each layer of the network, the MemN2N network can be trained from input-output pairs and is applicable to tasks where supervision is not available (Sukhbaatar et al. [16]).

For accessing the data more efficiently, Stienstra [15] uses Source Awareness (SA). This technique splits up the dialogue in three different parts which en-sures a more efficient way of storing the information. By using this technique, information can be retrieved more efficiently and the results achieve higher ac-curacy.

In this thesis, data of restaurant bookings will be used, in which users ask the TDS to book a table in a restaurant at a specific location, price range and number of persons. The model is tested with two Knowledge Bases (KBs), con-taining only known entities and both known and unknown entities, respectively.

(6)

1.2 Research question

In this thesis Source Awareness (SA) will be implemented in the Memory End-to-End (MemN2N) network to ensure information is retrieved more efficiently. The research question (RQ) for this thesis is: How will the addition of Source Awareness affect the results of the current Memory End-to-End network? To answer this question are three sub-questions formulated:

• SQ1 How does Source Awareness affect the performance of TDS? • SQ2 Is the model a good fit or is it over-fitting or under-fitting?

• SQ3 When and why does the model outperform the regular MemN2N network and when does the model fail?

1.3 Contributions and main findings

The contributions of this thesis mainly include:

1. A split Memory End-to-End network, which can capture the Source Aware-ness for a better performance. This network is called Source AwareAware-ness Memory End-to-End (SAMemN2N).

2. A case study on error types The main findings are:

1. Source Awareness is a useful technique for a TDS because they can store a dialogue in a more efficient way in which the results, user and system utterances are stored in different parts in the model.

2. The strength of the model (according to the cases where the MemN2N model is outperformed) lays in the tasks where the regular KB is used. The model shows significant improvements on task 4 and 5, displaying options and providing extra information, respectively.

3. The drawbacks of the proposed model lay in the cases in which unknown entities occur. This is implied by the results of the Tasks using the KB containing entities that were not seen by the network.

1.4 Overview

The rest of this thesis is structured as follows. Section 2 will describe the problem formulation. Section 3 gives some of the previous work related to End-to-End networks. Section 4 formally defines how Memory End-End-to-End networks work and explains how source awareness works. Section 5 describes the methods used in this paper and in section 6 the setup of the experiments is explained. Section 7 gives the results which are discussed in section 8. Section 9 concludes and in section 10 is future research proposed.

(7)

2 Problem definition

This section describes the formal definition of a TDS and specifies how dialogues are composed.

Dialogues in a TDS are conversations between a human and a computer. They take turns, also called utterances, which are defined as a unit of text without an interruption from the other speaker. A turn can therefore contain multiple sentences (Stienstra [15]).

To collect information from outside the dialogue, an external KB is used in which for the restaurant booking domain the cuisine type and contact informa-tion of restaurants is stored. During a dialogue, the system can access this KB to look for the restaurant matching the demands of the user.

A TDS uses the KB together with the dialogue to generate a response Rt as shown in equation 1. In this equation Dtstands for the dialogue at time t in the conversation (Stienstra [15]).

T DSΘ(Dt, KB) → Rt (1)

The KB is accessed when the system queries an utterance with the words ‘api call’, which triggers the system to look for information in the KB.

When looking into the dataset, typically a dialogue consists of 5 different types of utterances which leads to five different tasks proposed by Sakai et al. [13]. In Figure 1, the tasks are visualised using a conversation.

• Task 1 issuing API calls. For the system to find the best answer for the user, the system asks questions to fill in all the required fields for generating the API call.

• Task 2 updating API calls. Users can decide that they have different demands and they can adjust their question up till four times. Then the system updates the API call.

• Task 3 displaying options. When the first two tasks are executed, the system queries the KB to find restaurants that fit the user.

• Task 4 providing extra information When the user agrees to a restau-rant, the system queries the knowledge base to give the additional facts of the restaurant.

• Task 5 conducting full dialogues. This task is a full dialogue and thus all the tasks 1-4 are combined.

(8)

Figure 1: Five different tasks explained in a sample dialogue (Stienstra [15]). In section 1.2 is the main RQ presented together with three SQs. Below is described how this thesis aims to answer those three questions to answer the main RQ.

To Answer SQ1, the values of ˆa and a are used, being the predicted and true label, respectively. These values are listed in Table 1 and their equations are explained in section 6.

For providing an answer to SQ2, the Loss value is used. section 5 explains the meaning of this value and discusses the importance of it.

The answer on SQ3 is given by inspecting the mistakes made by the system. The table is in 6.

(9)

Symbols Meaning

T DSΘ Formal definition of a Task-oriented Dialogue System ut User utterance at time t

st System utterance at time t rt Result at time t

U Utterances of conversation spoken by user u1, ..., ut−1 S Utterances of conversation spoken by system s1, ..., st−1 R Result history of conversation r1, ..., rt−1

Dt Dialogue at time t Rt Response at time t

Φ Maps utterance to bag of words dimension µ Embedded bag of words vector of user utterances η Embedded bag of words vector of system utterances

A Embedding matrix

p Probability vector

m Embedded bag of words vector

ˆ

a Predicted label

a True label

q User’s last utterance (query)

u Embedded version of query q

o Output vector

k _{The current hop the system is in}

KB Knowledge Base ¯ d Mean difference T T-statistic Sd Standard deviation n amount of samples L Loss

Table 1: All glossaries that are used throughout the paper

3 Related work

3.1 Pipeline method

The traditional method of building a TDS is the pipeline approach. This ap-proach consists of four components (Chen et al. [3]).

1. Natural Language Understanding (NLU) 2. Dialogue State Tracker (DST)

3. Policy Learner (PL)

(10)

The four components run separately and are interdependent (Stienstra [15]). This means that all the components need to be pre-trained and if one module changes, the whole model needs to be retrained. Also, the NLU requires a lot of human effort because the system relies on hand crafted rules (Chen et al. [3]). Another drawback of the traditional approach, is the use of slot filling in NLU. In slot filling, different slots are chosen to fill during a conversion (Bordes et al. [2]). For example in the choice of a restaurant, slots can be; price, number of people and the city you want to dine. In the sentence “I want to book a table at an Italian restaurant for four persons in “New York”, the slot-value pairs are {City, New York; Cuisine, Italian; Amount of people, 4 }. Slot filling works very well on domain specific tasks, but is difficult to scale to other domains. All slots for all possible domains should be encoded to achieve this (Stienstra [15]). Recent work from Liu and Lane [7] shows an attention network that improves the performance of a regular RNN on slot filling, proving that a network with an attention mechanism improves on the slot filling method.

In the DST a representation of the dialogue so far is composed. Whereas this method uses the previous dialogue state and the output of the NLU module, Young et al. [20] use a process where dialogues behave as a Markov process in which each state is modelled by a probability function.

The PL generates an action based on the output of the DST. Whereas in simple cases a rule based system is used to create a mapping between the dialogue state (Yan et al. [19]), Williams et al. [18_{] use an RNN to provide that} mapping (Stienstra [15]).

In the final module the answer is generated using the output of the policy learner.

3.2 End-to-End Networks

End-to-End (N2N) models use a neural network which learns from previous conversations. These networks can therefore be used on multiple domains which results in a network that is no longer focused on specific domains (Bordes et al. [2]). The focus on multiple domains sounds promising, but still this approach has its flaws. By making the model applicable on multiple domains, the model requires a large amount of data and more importantly, should be capable of processing all this data. Furthermore, because the traditional method works extremely well on domain specific tasks, for N2N networks to be useful, they must be able to perform as good as the traditional methods.

N2N dialogue systems are generally categorised as generative or retrieval based models.

Generative methods. In this method, a response is generated word for word given the dialogue history (Stienstra [15]). These approaches mainly are non task-oriented. However Eric et al. [4] do use this method in which they use a KB.

(11)

Retrieval methods. This method generates a response by selecting the answer with maximum probability out of a list of candidates. This makes it a classification problem. This method of evaluation is proposed by Lowe et al. [9_{] and is called Next-Utterance-Classification (NUC) and will be used in this} paper.

Because N2N networks look really promising because of their possibility to scale to multiple domains, many of the recent models exploit this type of network to improve performance of a TDS.

The model proposed in this paper is also based on N2N networks. The MemN2N network by Sakai et al. [13] which is a retrieval based N2N network. This network contains an attention function for reading the memory. Further details about this network are in section 4.1.

4 Background

4.1 Memory End-to-End network

Sakai et al. [13_{] designed the MemN2N network which uses an explicit memory} to which, during the conversation, the utterance of the user and the computer are appended. Thereafter an attention function is used for reading the memory. This is achieved with multiple layers, also called hops, in which the output generated by a layer is taken as input in the next layer. Moreover, the network can use the previously stored historical dialogues and short-term context to reason about the required response.

MemN2N networks perform extremely well on factual reasoning and de-duction, but on multi-fact question answering, positional reasoning or dialogue related tasks the network faces some difficulties (Liu and Perez [8]).

Input of memory representation. When going through the conversation, at every time step t, the previous utterance of the user ut and the responses of the system st are appended to the memory. The goal is thus at time t to choose st (Sakai et al. [13]). Every utterance u1 ... ut−1and s1... st−1 is con-verted to vectors Uu and Ususing embedding matrix A. Furthermore, Φ maps the utterance to a bag of words dimension. The result is shown in equation 2. In the original Memory Network model presented by Weston et al. [17] the utterances do not contain information on which utterance it is. This is added in Φ by extending the vocabulary which encodes the index i into the bag of words (Sukhbaatar et al. [16]).

m = (AΦ(u1), AΦ(s1)..., AΦ(ut−1), AΦ(st−1) (2) The last utterance of the user is called the query (q) and is embedded using the same embedding matrix A to obtain state u.

(12)

In equation 3 the match between each memory part and the query is com-puted by taking their inner product, followed by a softmax, which returns a probability p (Sukhbaatar et al. [16]). This step is an important change be-cause this produces an attention over the memory and ensures reasoning over previous utterances.

pi= Sof tmax(uTmi) (3)

Output of memory representation. The output vector of the model is computed using the embedded bag of words vector m (equation 2) and its prob-abilities p (equation 3) by summing each corresponding input vector weighted by its probability vector:

o =X

i

pimi (4)

Generating final prediction. In equation 5, the sum of the output vec-tor o and input embedding u is generated, in which k is the current hop the system is in (Sukhbaatar et al. [16]).

uk+1= ok+ uk (5)

The answer of equation 5 is used to produce the predicted label (Raunak [12]).

ˆ

a = argmax(sof tmax((W (uk+1)) (6)

As can be derived from equation 5 and 6, the output of a hop is used as an input in the following hop.

Figure 2 gives a more detailed visualisation of the composition of the net-work, figure 2a being a single layered network and figure 2b being multiple layered. Figure 2b shows that the output of the previous layer is passed to the next layer which happens using equation 5.

(13)

Figure 2: A more detailed visualisation of the composition of the Memory Net-work existing of one layer in (a) and existing of three layers in (b) [16].

The entire model is trained using stochastic gradient descent (SGD), mini-mizing a standard cross-entropy loss between ˆa and the true label a (Sakai et al. [13]).

4.2 Source Awareness

When looking for entities in the dialogue history, different type of entities are necessary. For instance when making a reservation for a restaurant, relevant entity types are cuisine type, location, contact details and the name of the restaurant. This information is stored in different parts of the dialogue. In the recently published thesis by Stienstra [15], Source Awareness is applied. This new technique separates the dialogue in three parts: user history, system-history and result-history. In this way, differences in parts of the dialogue system are exploited. This provides a better focus on the different parts of the dialogue and can thus extract information more precisely. Whereas the cuisine type and location are stored in the user-history, the contact details and name of the restaurant can be found in the result history.

Using this technique will thus use less memory given the possibility to look up the information in smaller parts of the dialogue. Figure 3 shows an example of the three different dialogue parts.

(14)

Figure 3: Splitting the dialogue in parts. The dialogue is split up into a re-sult, user and system history which consist respectively of u1, ..., ut−1, s1, ..., st−1 and r1, ..., rt−1[15].

The language that is used in the three different parts strongly differs (Stien-stra [15]). The vocabulary of result history is very small and the structure is roughly the same containing dense information. In the user history however is a wide variety in use of words and words are often redundant. With Source Awareness, the importance of utterances can be better determined. The ex-isting MemN2N model implemented by Raunak [12] is used in this paper to extend with Source Awareness1.

5 Methods

This section describes the methods that are used in this paper. The methods are used to enhance the regular MemN2N network by using Source Awareness. This results in the Source Awareness Memory End-to-End (SAMemN2N).

When having a more detailed look at the composition of a dialogue in general, several things catch the eye. Typically, a dialogue consists of responses from the user, the dialogue system itself and the results query. This means that some pieces of information always only occur in specific parts of the dialogue. When the system is, for instance, looking for the type of cuisine, this is exclusively found in the responses of the user whereas the contact details of the restaurant can be found in the results query.

1_The _code _for _the

(15)

The Source Awareness technique uses this information to split up the dia-logue history in three different parts:

• User-history(U), all the utterances from the user u1, ..., ut−1. • System-history (S), all the utterances from the system s1, ..., st−1. • Result-history (R), which contains all the output from the knowledge base

query r1, ..., rt−1.

Section 4.1 described the formal definition of the MemN2N and showed how an attention function is implemented in the network. To extend this network with Source Awareness some adjustments should be made. Whereas in equation 2 the embedded bag of words vector contains the dialogues of both the user and system, this model splits up the equation in two equations. This results in the following two equations with equation 7 containing the bag of words vector of the user utterances and equation 8 containing the bag of words vector of the system utterances.

µ = AΦ(U) (7)

η = AΦ(S) (8)

These two equations are both used to compute the probability vector.

pi= Sof tmax(uTµiηi) (9)

Finally, equation 4 is being adapted to be able to handle multiple bag of words vectors:

o =X

i

piµiηi (10)

Loss function. In neural networks is the loss an interpretation of how well the model is doing on the train and validation set. Unlike the measurement of accuracy which is expressed as a percentage, loss is measured as a summation of the errors made for each example in the training or validation set (Peng et al. [11]). In this thesis the loss is defined as the softmax cross-entropy between the predicted answer and the generated output calculated using equation 11 in which ˆa is the encoded label and a the predicted label. a is first normalized using softmax.

(16)

The cross-entropy tending toward zero as the neural networks improves, but never reaching zero, makes it a good measurement for interpreting the loss of the network (Nielsen [10]).

6 Experiments

This section describes the design of the experiments, information about the dataset is given and the computation of the results is explained.

6.1 Datasets

All the frameworks are tested with data of restaurant bookings. Two types of data are used, the bAbI dialogues and the Dialog State Tracking Challenge 2 (DSTC2) dataset.

bAbI dialogue2_{. This data consists of dialogues simulated by a computer.} The dialogues of this dataset are generated with the framework introduced by Bordes et al. [2]. A big challenge in dialogues is handeling entities. Because the data contains a lot of different entities, it is inevitable that at some point the system sees an entity which was never seen before. Therefore, the KBs are split in half. The first KB is used to create the standard train, validation and test sets. The second KB is used to generate test dialogues, termed Out-Of-Vocabulary (OOV) test sets (Bordes et al. [2]). The entities in this set are thus unseen in any training dialogue and expected to be harder to handle.

Dialogue State Tracking Challenge3_{. Whereas the bAbI dialogues are} sim-ulated by a computer, the DSTC2 dataset consists of real user-system data (Henderson et al. [6]). This data also contains restaurant booking data and has three fields: type of cuisine, location and price range, which have 91, five and three choices, respectively. Because this dataset is real user-system data, it contains more noise.

6.2 Evaluation

Accuracy. The performance of the model will be measured on turn level ac-curacy which is defined as the number of correct responses out of all responses (Stienstra [15_{]). Because the MemN2N is a retrieval based method, the} re-sponse is generated by selecting out a list of candidates instead of generating an output (Lowe et al. [9]).

(17)

Paired t-test. Once the results are obtained, a check must be conducted to see if the results are significant. A paired t-test is used to evaluate the mean difference between matched data points, in this case the results of the MemN2N and SAMemN2N network (Hedberg and Ayers [5]). First, the difference be-tween the two datapoints xiand yiare calculated: di= xi− yi. Then its mean difference ¯d is calculated. Next, the standard error of the mean difference is calculated where Sd is the standard deviation and n is the amount of samples.

SE( ¯d) = √Sd

n (12)

Lastly, the t-statistic is calculated: T =

¯ d

SE( ¯d) (13)

This value can be looked up in the t-distribution to conduct the p-value, which indicates the significance of the result.

7 Results

This section describes the results of the experiments and shows the types of mistakes the system makes.

Accuracy. Table 2 shows the average turn accuracy on the five bAbI dialogue tasks and the DSTC2 (Task 6). Task 1 to 5 are tested with both the OOV and the regular KB. Because Task 6 consists of real conversations between human and system, no OOV KB exists. The model outperforms the regular MemN2N network on almost every task. A more elaborate analysis on this task will be discussed later in this section.

MemN2N SAMemN2N Task 1 99,90 99,90±0,060 Task 2 100,00 100,00±0,014 Task 3 74,90 99,69±0,288 Task 4 59,50 93,09±1,220 Task 5 96,10 87,60±0,451 Task 1 OOV 78,86 79,55±2,621 Task 2 OOV 74,67 78,87±0,010 Task 3 OOV 75,20 72,33±0,918 Task 4 OOV 56,98 56,98±0,000 Task 5 OOV 64,29 60,41±0,462 Task 6 40,60 65,78±0,501

Table 2: Results of MemN2N and SAMemN2N networks. Both networks are tested on Task 1 to 5 with the regular KB and the OOV KB (Mean and

(18)

Loss. To get more insights in the performance of the model, research is done on the validation and training loss. In Figure 4 and 5 are the training and validation loss of six tasks plotted without and with OOV KB, respectively4. All six tasks in both figures show a decrease in loss as the amount of iterations increases. Also the training and validation losses have roughly the same value. The difference in the two figures are in the noise of the curve. Whereas in Figure 4 the values show very little noise, Figure 5 shows more noise in its curves. The difference in noise is explained in section 8.

The only figure that stands out is Figure 4f in which the training loss is significantly larger than the validation loss. This will be discussed in section 8. The losses of the regular MemN2N network are plotted to compare the results. These losses are presented in Figure 6.

(a) Task 1 (b) Task 2 (c) Task 3

(d) Task 4 (e) Task 5 (f) Task 6

Figure 4: Training and validation losses on different tasks using the SAMemN2N network without OOV KB

.

(19)

(d) Task 4 (e) Task 5

Figure 5: Training and validation losses on different tasks using the SAMemN2N network with OOV KB

(d) Task 4 (e) Task 5 (f) Task 6

(20)

Comparing Figures 4 and 5 with Figure 6, the losses on the first two figures show a more stable curve and also the values are much lower. This shows a sig-nificant improvement on the current Source Awareness model considering the value of the losses.

Significance. With the results on table 2 and the equations in section 6.2, the t-statistic can be calculated to find its corresponding p-value. This resulted in a p-value of 0.00015_.

Mistake inspecting. As mentioned earlier, the model improves on almost every task. On task 5 however, the model decreases in accuracy relative to the regular MemN2N. To provide more insight in the outcome of the model, all the cases in which the system predicted a wrong answer were divided in the following 3 mistakes, each divided in small or big mistakes. In total, this leads to 7 types of mistakes which are listed in Table 3 together with its occurrences. 1. Wrong follow up. This can occur when, for instance, the system asks for the price range of the user, but this was mentioned earlier in the conversation.

Predicted answer: which price range are you looking for? Real answer: ok let me look into some options for you

2. API call mistakes. When the system makes an API call it occurs that mistakes are made. This type of mistake is divided in three sub mistakes which are distinguished by the number of mistakes in a single API call.

• API call one mistake

Predicted answer: api call spanish madrid six cheap Real answer: api call spanish paris six cheap

• API call two mistakes

Predicted answer: api call spanish rome two cheap Real answer: api call spanish london six cheap • API call more than two mistakes

Predicted answer: api call italian rome eight cheap Real answer: api call british bombay six moderate

3. System answers. After the system makes an API call the system returns an answer. It occurs that the system returns a wrong answer. This mistake is divided in three sub mistakes distinguished by the number of mistakes made in a single system answer.

• Answer with one mistake. Predicted answer:

what do you think of this option: resto bombay cheap spanish 7stars

(21)

p-Real answer:

what do you think of this option: restobombay cheap spanish 6stars • Answer with two mistakes.

Predicted answer:

what do you think of this option: resto madrid moderate italian 7stars

Real answer:

what do you think of this option: resto bombay moderate italian 2stars

• Answer with more than two mistakes. Predicted answer:

what do you think of this option: resto rome expensive french 1stars Real answer:

what do you think of this option: resto bombay expensive british 3stars

As can be seen in Table 3, the type of mistake which occurs the most is the ‘follow up’ mistake. This mistake occurs when the system asks a question for retrieving information which is already in the conversation. Two other mistakes which occur often are when the system proposes a restaurant but makes only one mistake. The worst case is the third mistake in which the system makes an API call with more than two mistakes. In this case, the system always has almost every detail of the API call wrong. Although the first four tasks make few mistakes, it is interesting to see where the mistakes are made. More details on dialogues and its mistakes are given in the appendix in section 11.

Tasks Types of mistakes

Follow up 1-API 2-API >2-API 1-system 2-system >2-system T1 3 (0,45) 0 (0) 0 (0) 0 (0) 4 (0,55) 0 (0) 0 (0) T2 0 (0,20) 0 (0,30) 0 (0) 0 (0) 1 (0,50) 0 (0) 0 (0) T3 8 (0,21) 0 (0) 0 (0) 0 (0) 25 (0,64) 0 (0) 6 (0,15) T4 1 (0) 0 (0) 0 (0) 0 (0) 17 (0,08) 40 (0,17) 172 (0,75) T5 1055 (0,46) 88 (0,04) 180 (0,08) 373 (0,16) 472 (0,20) 48 (0,02) 99 (0,04) T1 OOV 132 (0,11) 0 (0) 959 (0,77) 40 (0,03) 107 (0,09) 0 (0) 0 (0) T2 OOV 0 (0) 2 (0) 1647 (0,82) 350 (0,17) 0 (0) 0 (0) 0 (0) T3 OOV 258 (0,09) 0 (0) 0 (0) 0 (0) 77 (0,03) 351 (0,13) 2068 (0,74) T4 OOV 0 (0) 0 (0) 0 (0) 0 (0) 0 (0) 65 (0,04) 1445 (0,96) T5 OOV 1059 (0,14) 0 (0,0) 457 (0,06) 1543 (0,21) 415 (0,06) 419 (0,06) 3489 (0,47) T6 2210 (0,53) 37 (0,01) 12 (0) 103 (0,03) 1694 (0,43) 18 (0) 0 (0)

Table 3: Occurrence of each mistake type in Task 1 to 6 and the relative percentage in brackets (mean over 10 runs).

(22)

Hyper parameter investigation. Because the algorithm takes some time to run a task (+/- two hours per task), tweaking on some of the hyper param-eters is performed. All of the above results are obtained using 200 epochs. But because some of the test accuracies reach close to, or even 100%, it is interesting to investigate if maybe 200 epochs is too many for some tasks. The results of the accuracy against the number of epochs, is shown in Figure 7a and 7b, showing the results without and with OOV Knowledge Base, respectively. In Figure 7a the accuracies converge very fast to a stable value, but in Figure 7b the values are much more unstable.

(a) (b)

Figure 7: Accuracies of Task 1 to 6 using the SAMemN2N network plotted against the number of epochs. Figure 7a and 7b are the accuracy using the regular KB and OOV KB respectively

8 Discussion

This paper extended the Memory End-to-End network proposed by Weston et al. [17] by implementing Source Awareness. Despite the results showing large overall improvements, some results are not as expected. These results will be discussed in this section.

The results of using a KB with and without OOV are quite different. This can be seen in three performance measurements. Table 2 shows that using the regular KB shows more improvements than the latter. Furthermore, there is a noticeable difference in Figure 4 and 5. Whereas the diagrams in Figure 4 show very little noise, Figure 5 shows more noise in its curves. This can be explained by the unknown entities in the OOV providing more noise.

(23)

More differences between using the regular and OOV KB can be seen in task 1 and 2. In Table 3 the type of mistakes are listed with its occurrences. In Task 1 and 2 a large amount of mistakes in the api calls are made when using the OOV KB. With the regular KB however, the system hardly makes any mistakes. Making many mistakes in api calls is not surprising since these two tasks involves issuing and updating api calls.

Table 3 also shows that Task 3 OOV makes more mistakes in the system responses. Whereas Task 3 with the regular KB makes at most one mistake in its system response, task 3 with OOV KB generates mostly responses with more than two mistakes.

When examining the difference between Task 5 with and regular KB and OOV KB, the mistakes being made are quite different. Whereas the first makes half of its mitakes in asking the wrong follow up question, the latter mostly makes mistakes in generating a response. Using the OOV KB has a positive result on asking the follow up question since the relative percentage of this mistake is much lower than for using the regular KB.

The last difference between the results of using the regular and OOV KB is in Figure 7. Whereas in Figure 7a the values converge to a stable value very fast, Figure 7b shows that when using the OOV KB, the values are not stable. From these results can be inferred that the system is struggling with unknown entities.

The mistakes on Task 3 en 4 with both KBs, show that the model strug-gles with generating the right response. This is in line with the expectations because these two tasks generally involve displaying options and dealing with extra information.

The high loss values in Figure 4f compared to the other diagrams in Figure 4 is because the DSTC2 dataset contains real data between human and system. Another struggle of this model regards the word embeddings. When the sys-tem knows it must output a phone number, word embeddings make it hard to distinguish between phone numbers and addresses (Weston et al. [17]).

The number in Table 2 that strikes the attention is the standard deviation of Task 4 OOV for the SAMemN2N. Despite the fact that the model has been tested 20 times, the mean of this task is always exactly the same. Table 3 however shows that the mistakes are not always exactly the same. It seems like in this task it reaches a maximum somehow, which is also proved by Table 7b in which the line of Task 4 is a perfect straight line. An explanation for these results have not yet been found.

The most significant improvement in performance is the one for task 6 which showed an increase in performance by 150% compared to the original model. This is promising because this is the dataset which uses real conversations be-tween human and system instead of simulated data in the first 5 tasks.

To place the above discussed results into perspective, some strengths and weaknesses of this study should be noted. This paper used a lot of measurements to be able to say as much as possible about the data. Whereas Sukhbaatar

(24)

measurements on top of the accuracy to check the results. First, the paired t-test was used to check the significance of the results. Furthermore, the losses of the model are studied to ensure the model is not over- or underfitting. Lastly, a study on mistakes was conducted to find the strengths and flaws of the model. This mistake study could have been extended such that more can be said about specific flaws of this model.

9 Conclusion

This paper introduced the SAMemN2N model and it outperforms the regular MemN2N on almost every task 1 to 6.

One of the decreases in accuracy is in Task 5 in both the OOV and non OOV KB. Looking at the type of mistakes shows that the the only mistakes the system makes in this task involves suggesting a restaurant. Most of the time it makes more than two mistakes in its suggestion. This infers that this network has much difficulty dealing with extra information. This answers SQ1, “how is the accuracy affected?”

To answer SQ2, “is the model a good fit or is it over- or underfitting?” can be found in the differences in Figure 4 and 6. From these figures can be derived that the network behaves as expected, having a lower loss than its original MemN2N, and showing the expected decrease.

The model has been tested 20 times and the standard deviation being at most 1.2% together with the results of the paired t-test say that the results of this paper are significant.

The decrease in performance when using the OOV Knowledge Base is due to the unknown entities. On the follow up questions however, the system improves compared to the tasks using the regular KB. This answers SQ3, “When and why does our model win (compared with baseline) and fail (compared with golden labels)?”

In conclusion can be stated that the Source Awareness Memory End-to-End model outperforms the existing Memory End-to-End model. Measurements show that the results are significant and the model is a good fit. An error in-specting is conducted and on of the model’s flaws lays in the asking the right follow up question. This answers the research question “How will the addi-tion of Source Awareness affect the results of the current Memory End-to-End network?”

10 Future work

Despite the results showing the value of this thesis, still some improvements could be made.

Firstly, in section 4 was talked about word embeddings. Whereas word embeddings are useful for handeling synonyms, words denoting entities can make the system struggle.

(25)

Entities like a new restaurant are often not seen in trainingdata and thus no word embedding is available. To address this problem, Sakai et al. [13] extended the MemN2N network with “match type” features. Each entity type (e.g. cuisine type, location) is added to the vocabulary. Now any word that matches its specific entity type can be matched even though this word has never been seen before in the training data. This will resolve the problem of lacking word embedding and thus will not result in failure. The results of this technique were promising, but unfortunately the code for this technique was not available. Contact was made with the authors of the paper, but the code was not made available by the author. For future research, match type features could be added to extend the SAMemN2N network.

The second addition of the MemN2N implemented by Sakai et al. [13] is the measurement of the per dialogue accuracy. For the performance of the model, this could be important to measure and can thus be done in future work.

Lastly, Liu and Perez [8_{] extended the MemN2N to regulate the} interac-tions between the memory and the controller modules. Whereas in MemN2N the interactions needed to be decided a priori or with the use of supervised tech-niques, Liu and Perez [8] use highway networks to ensure dynamic regulation of memory interaction (Srivastava et al. [14]). This Gated Memory End-to-End (GMemN2N) is thus capable of dynamically conditioning the memory reading operation on the controller state at each hop. This ensures only the relevant information flowing through each hop.

11 Appendix

This section shows dialogues with its predicted answers. The tables contain a column with time, speaker and dialogue. These three are indicating the point in the conversation, speaker and the sentence respectively. In some tables a big jump in time is made (for example in 5 in which the dialogue jumps from 3 to 42. In these cases, all the dialogues in between are rules from the KB and not relevant for showing in this table. A dialogue from task 2 is left out because this task reaches 100%.

Time Speaker dialogue 1 User hi

2 System hello what can I help you with today?

3 User can you book a table for four people in a moderate price range with spanish cuisine in bombay?

4 System i’m on it

5 System <SILENCE> ok let me look into some options for you api call spanish bombay four moderate

MemN2N api call british bombay four moderate

Correct answer api call spanish bombay four moderate [Correct]

Table 4: Task 1 sample dialogue. In this example the original MemN2N network predicts the answer wrong, whereas the predicts the right answer.

(26)

Time Speaker Dialogue

1 System resto paris expensive french 2stars R location paris 2 System resto paris expensive french 2stars R number six 3 System resto paris expensive french 2stars R price expensive .. .. ....

42 System resto paris expensive french 2stars R rating 2 43 User good morning

44 System hello what can I help you with today? 45 User can you book a table?

46 System I’m on it

47 System <SILENCE> any preference on a type of cuisine 48 User with french cuisine

49 System where should it be? 50 User paris please

51 System how many people would be in your party? 52 User for six please

53 System which price range are looking for? 54 User in a expensive price range please 55 System ok let me look into some options for you

56 System <SILENCE> what do you think of this option: resto paris expensive french 7stars 57 User no this does not work for me

58 System sure let me find an other option for you

59 System <SILENCE> what do you think of this option: resto paris expensive french 5stars 60 User do you have something else?

62 System <SILENCE> what do you think of this option: restoparis expensive french 4stars 63 User no this does not work for me

what do you think of this option: resto paris expensive french 2stars MemN2N what do you think of this option: resto rome expensive italian 1stars

Correct answer what do you think of this option: resto paris expensive french 2stars [Correct]

Table 5: Task 3 sample dialogue in which the network outputs the correct answer in contrast to the regular MemN2N network

1 System resto rome cheap spanish 1stars R phone 2 System resto rome cheap spanish 1stars R cuisine spanish 3 System resto rome cheap spanish 1stars R address 4 System resto rome cheap spanish 1stars R location rome 5 System resto rome cheap spanish 1stars R number eight 6 System resto rome cheap spanish 1stars R price cheap 7 System resto rome cheap spanish 1stars R rating 1 8 User hello

9 System hello what can I help you with today?

10 User can you book a table at resto rome cheap spanish 1stars? 11 System great let me do the reservation

12 User may I have the address of the restaurant? here it is resto rome cheap spanish 1stars address MemN2N here it is resto bombay expensive italian 3stars phone Correct answer here it is resto rome cheap spanish 1stars address [Correct]

(27)

Time Speaker Dialogue 1 User hi

2 System hello what can I help you with today? 3 User can you book a table wit british cuisine? 4 System I’m on it

5 System <SILENCE> where should it be? 6 User rome please

7 System how many people would be in your party? 8 User we will be six

9 System which price range are looking for? 10 User in a expensive price range please 11 System ok let me look into some options for you 12 System <SILENCE> api call britis rome six expensive 13 User instead could it be in bombay?

14 System sure is there anything else to update? 15 User actually I would prefer with italian food 16 System sure is there anything else to update? 17 User actually I would prefer in a cheap price range 18 System sure is there anything else to update? 19 User no

20 System ok let me look into some options for you 21 System <SILENCE> api call italian bombay six cheap 22 System resto bombay cheap italian 4stars

.. ... ...

47 System resto bombay cheap italian 1stars R rating 1 48 System <SILENCE> what do you think of this option:

resto bombay cheap italian 3stars 49 User no I don’t like that

50 System sure let me find an other option for you 51 System <SILENCE> what do you think of this option:

resto bombay cheap italian 2stars 52 User do you have something else?

53 System sure let me find an other option for you 54 System <SILENCE> what do you think of this option:

resto bombay cheap italian 1stars 55 User I love that

56 System great let me do the reservation 57 User can you provide the address?

58 System here it is resto bombay cheap italian 1stars address 59 User thank you

60 System is there anything I can help you with 61 User no thank you

62 System you’re welcome

MemN2N what do you think of this option: resto bombay cheap italian 2stars what do you think of this option: resto bombay expensive british 5stars Real answer what do you think of this option: resto bombay cheap italian 2stars [Incorrect]

Table 7: Dialgue of Task 5 in which the network outputs the wrong answer in contrast to the regular MemN2N network.

(28)

1 System <SILENCE > Hello , welcome to the Cambridge restaurant system.

You can ask for restaurants by area , price range or food type . How may I help you ? 2 User I need to find a cheap restaurant

3 System There are restaurants in the cheap price range . What type of food do you want? 3 User any

4 System You are looking for a restaurant serving any kind of food right? 4 User yes

5 System What part of town do you have in mind? 5 User south

6 System api call R cuisine R location cheap

6 System thanh binh R post code thanh binh post code 144 System the gandhi R price cheap

145 System the gandhi R rating 0

MemN2N the missing sock is a nice place in the east of town and the prices are cheap nandos is a nice place in the south of town and the prices are cheap

Real answer nandos is a nice place in the south of town and the prices are cheap [Correct]

Table 8: Dialgue of Task 5 in which the network outputs the wrong answer in contrast to the regular MemN2N network.

(29)

References

[1] Arora, S., Batra, K., and Singh, S. (2013). Dialogue system: A brief review. arXiv preprint arXiv:1306.4134.

[2] Bordes, A., Boureau, Y.-L., and Weston, J. (2016). Learning end-to-end goal-oriented dialog. arXiv preprint arXiv:1605.07683.

[3] Chen, H., Liu, X., Yin, D., and Tang, J. (2017). A survey on dialogue systems: Recent advances and new frontiers. ACM SIGKDD Explorations Newsletter, 19(2):25–35.

[4] Eric, M., Krishnan, L., Charette, F., and Manning, C. D. (2017). Key-value retrieval networks for task-oriented dialogue. In Proceedings of the 18th Annual SIGdial Meeting on Discourse and Dialogue, pages 37–49.

[5] Hedberg, E. and Ayers, S. (2015). The power of a paired t-test with a covariate. Social science research, 50:277–291.

[6] Henderson, M., Thomson, B., and Williams, J. D. (2014). The second dialog state tracking challenge. In Proceedings of the 15th Annual Meeting of the Special Interest Group on Discourse and Dialogue (SIGDIAL), pages 263–272. [7] Liu, B. and Lane, I. (2016). Attention-based recurrent neural network models for joint intent detection and slot filling. arXiv preprint arXiv:1609.01454. [8] Liu, F. and Perez, J. (2017). Gated end-to-end memory networks. In

Pro-ceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, volume 1, pages 1–10. [9] Lowe, R., Serban, I. V., Noseworthy, M., Charlin, L., and Pineau, J. (2016). On the evaluation of dialogue systems with next utterance classification. In 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 264–269.

[10] Nielsen, M. A. (2015). Neural networks and deep learning, volume 25. Determination press USA.

[11] Peng, B., Lu, Z., Li, H., and Wong, K.-F. (2015). Towards neural network-based reasoning. arXiv preprint arXiv:1508.05508.

[12] Raunak, V. (2017). Tensorflow implementation of learning end-to-end goal-oriented dialog. https://github.com/vyraun/ chatbot-MemN2N-tensorflow.

[13] Sakai, A., Shi, H., Ushio, T., and Endo, M. (2017). End-to-end memory networks with word abstraction and contextual numbering for goal-oriented tasks. Dial. Syst. Technol. Challenges, 6.

(30)

net-[15] Stienstra (2018). Role-aware recurrent entity networks for task-oriented di-alogue systems (unpusblished Master thesis). University of Amsterdam, Am-sterdam, Netherlands.

[16] Sukhbaatar, S., szlam, a., Weston, J., and Fergus, R. (2015). End-to-end memory networks. In Cortes, C., Lawrence, N. D., Lee, D. D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems 28, pages 2440–2448. Curran Associates, Inc.

[17] Weston, J., Chopra, S., and Bordes, A. (2015). Memory networks. In International Conference on Learning Representations (ICLR).

[18] Williams, J. D., Asadi, K., and Zweig, G. (2017). Hybrid code networks: practical and efficient end-to-end dialog control with supervised and reinforce-ment learning. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, volume 1, pages 665–677.

[19] Yan, Z., Duan, N., Chen, P., Zhou, M., Zhou, J., and Li, Z. (2017). Building task-oriented dialogue systems for online shopping. In AAAI, pages 4618– 4626.

[20] Young, S., Gaˇsi´c, M., Thomson, B., and Williams, J. D. (2013). Pomdp-based statistical spoken dialog systems: A review. Proceedings of the IEEE, 101(5):1160–1179.

[21] Zaremba, W. and Sutskever, I. (2014). Learning to execute. arXiv preprint arXiv:1410.4615.

source awareness in end to end task oriented dialog systems

Source Awareness Memory End-to-End

for Task-oriented Dialogue Learning

Tyler Cools

11004851

Jiahuan Pei

Abstract

Acronyms

Contents

1

Introduction

1.1

Motivation

1.2

Research question

1.3

Contributions and main findings

1.4

Overview

2

Problem definition

3

Related work

3.1

Pipeline method

3.2

End-to-End Networks

4

Background

4.1

Memory End-to-End network

4.2

Source Awareness

5

Methods

6

Experiments

6.1

Datasets

6.2

Evaluation

7

Results

8

Discussion

9

Conclusion

10

Future work

11

Appendix

References