Anti-overestimation Reinforcement Learning Policy for Task-completion Dialogue System

(1)

MSC

ARTIFICIAL

INTELLIGENCE

M

ASTER

T

HESIS

Anti-overestimation Reinforcement Learning

Policy for Task-completion Dialogue System

by

CHANG

TIAN

11653205

March 11, 2020

48 EC June 2019 - March 2020

Daily Supervisor:

Dr. Pengjie Ren (UvA)

Assessor and Examiner:

Prof.Dr. Maarten de Rijke (UvA)

(2)

Abstract

A task-completion dialogue system (TDS) can be built by connecting functional modules to form a pipeline framework. Normally, this contains natural language understanding, state tracker, dialog pol-icy and natural language generation components. The dialog polpol-icy module is an essential part of a dialog agent, as it issues logical response based on conversation history. Hence, the maturity and ro-bustness of the dialog policy module affects the dialog success rate. Multiple dialog policy proposals exist. Recently, increasing and widespread interest goes out to reinforcement learning (RL) based dia-log policies. A reinforcement learning diadia-log policy can be trained on sequences of data points which usually contains the current state, action, reward and next state. This way, patterns in time series data can be learned. Input from a user and agent in dialog can also be viewed as time series data, and this time series usually contains logically connected features. The RL based dialog agent is a suitable so-lution for learning time series data. The method uses exploration during the training process, so that it explores more data patterns and improves its ability to generalize. This characteristic can increase the flexibility of the dialog policy under different situations. The dialog agents vision is that the dialog policy could choose optimal action under each state to get the optimal Q value of state action pair which could represent corresponding optimal state value.

Many reinforcement learning algorithms have been proposed to be deployed in the dialogue system. Among them, one RL schema is a Bellman Equation based training algorithm. In the loss computation for the dialog policy model, the target estimation will use a maximum operator to choose the highest Q value for each state among all possible actions candidates. However, before the dialog policy model is fully trained, the maximum Q value for a certain state among all action candidates is probable higher than ground truth optimal Q value of this state unless the dialog policy model is fully trained. This way, it leads to an overestimation problem of optimal Q value in target component of Bellman equation during the loss computation.

The overestimation problem results in a biased training loss, that could lead to inaccurate Q value function model parameters. In this case, the dialog policy model will be misled, so that it can only approximate inaccurate Q values of state action pairs. The overestimation problem hence degrades the model performance. Former work has already proposed a solution to this Q value overestimation problem, called Double DQN. This method decouples the action selection and evaluation in the training process. However, the action selection is performed according to optimal Q values of being trained DQN that has similarity with target DQN. So we still have the chance to choose action causing overestimation problem in target DQN. Also, duel DQN is proposed in research community by estimating Q value more accurate through using two streams to estimate Q value instead of conventional one stream. But it does not upgrade training algorithm so that max operator in the loss function still lead to overestimation problem. Recently, averaged DQN is proposed by using average of Q values from K previously trained networks to estimate ground truth Q value. But this method requires high computational effort.

This work proposes a new anti-overestimation RL training algorithm, called the dynamic partial av-erage model, which aims to mitigate the overestimation problem and result in more accurate DQN. In the target component of the Bellman equation, instead of only using the maximum Q value of the target DQN, it leverages both the maximum and minimum Q value of the target DQN. The model allocates partial weights to the maximum and minimum Q value to get an average and change the weights dy-namically during the training process. The average value will be used as the estimated optimal Q value in the target component of Bellman equation. In this way, we aim to solve the overestimation problem by using partial average value between the maximum Q value of target DQN and the minimum Q value of target DQN. Through changing the weights at the final training stage, we give most confidence to the maximum Q value of target DQN. This way, the being trained dialog policy model is more likely to converge to the optimal parameter point based on the fact that the stable target DQN network is already set with good parameter set.

We do experiments in the context of booking movie-tickets, leveraging an existing user simulator model to ease empirical algorithmic comparison. The dynamic partial average dialog policy proposed in this work and its model variants achieve better performance compared to the conventional anti-overestimation networks. After training the dialog policy, we also test the policy robustness under simulated error with different granularity levels specific to the language understanding module outputs. Finally, we also investigate the average turn changes of dynamic partial average model under different error granularity.

(3)

Acknowledgements

I would like to thank Doctor Pengjie Ren and Professor Maarten de Rijke for their supervision and research help during my thesis research work.

(4)

Introduction

1.1 Motivation

Dialog systems are usually divided into two categories: chitchat (Cho and Julien,2016) and task comple-tion dialogue system (Wen et al.,2016b). These systems have an increasing effect in social and industrial life. Via either auditory or textual input, people communicate with machines in natural language for task completion or daily entertainment. Usually, chitchat systems are designed to meet the mental needs of the user, such as mental support, while the task completion dialogue system aims at assisting users to complete a particular task through understanding user goals and requests in multiple communication turns and providing information. These dialogue systems mainly consist of information retrieval and dialog reasoning based on multiple turns. In this thesis research, we mainly focus on the task completion dialogue system.

For task completion dialogue systems, there are two main research schemes. One is the end-to-end task completion dialogue model (Liu and Lane,2018a), which backpropagates training loss between modules by training the whole system in an end-to-end training fashion. The other task completion di-alogue model is a pipeline framework which contains independently trained and modularly connected components such as natural language understanding (NLU) (Hakkani-Tür et al.,2016), dialogue state tracker (DST) (Lee and Stent,2016), dialog policy (DP) (Gaši´c and Young,2013) and natural language generation (NLG) (Wen et al.,2015b).

Previous research work (Liu and Lane,2018b) listed currently popular existing pipeline framework for task completion dialogue system. In Figure 1.1, a popular task completion dialogue system

archi-Natural Language Understanding

(NLU)

Dialog State Tracker (DST) Dialog Policy (DP) Natural Language Generation (NLG) Knowledge Base (KB)

Figure 1.1: Task Completion Dialogue System

tecture is presented. The auditory user speech is converted to text in the automatic speech recognition module (Renals and Lecture,2017), then text will be passed to the NLU module, where the users inten-tion and other key informainten-tion is extracted in the form of slot value pairs. These slot value pairs are formatted as semantic frame input to the dialog state tracker, that records current states of the dialogue by keep updating its state recording dictionary. State representation issued by DST is passed into the dialog policy module, where a dialog action will be selected based on existing data from an external knowledge base. The dialogue action executed by dialog policy is passed into the NLG module where the action will be converted into natural language.

(7)

dialog policy and NLG four modules. Whether end-to-end system or modularly connected pipeline system, both of them contain a dialog policy module which is essential for the dialog system to choose suitable dialog actions and maintains a reasonably logical discourse. In the end-to-end framework (Liu and Lane,2018b), the dialogue system learns the supervised dialogue policy by following the experts actions via behavior cloning. The cross-entropy loss is computed for dialog state tracking and sys-tem action predictions. Besides the end-to-end dialogue scheme, in the conventional dialogue pipeline framework, the dialog policy is also implemented in different methods. The Probabilistic model is one method for solving this problem (Harms et al.,2018), moreover, rule-based policy (Seneff et al.,1991) is also part of the research community. However, these days, reinforcement learning algorithm (Zhao and Eskenazi,2016a) has been proven to be a promising solution since it can do exploration, which improves the model robustness without needing a lot of training data. Also, reinforcement learning is good at learning patterns from sequences of training data.

In this master thesis work, we propose a RL training method which can be used in the dialog policy module. We train the dialog policy module independently and investigate the performances of model and its variants with or without noise from upstream NLU module. Reinforcement learning (Sutton et al.,1998) based policy can select actions based on dialog state representation so that it can be deployed in the task completion dialogue system for policy module. The Q value function is one method to represent the state action pair value. For every state, we have multiple action candidates that we can choose, for every state action pair, we use the Q function to approximate the value of this state action pair (Pazis and Parr,2011). The higher the Q value is, the more valuable that state action pair is. As for Q value function based RL research, there are mainly two research directions, one is reinforcement learning training algorithmic research (Farebrother et al.,2018), and the other research scheme is Q value approximation method research (Wang et al.,2015).

Previous work (Li et al.,2017) shows the deep Q-value network can apply to dialog policy module and achieve decent performances under noisy environment. However, DQN often suffers from Q value overestimation problem. In the target component of Bellman equation, DQN uses estimated optimal Q value of state action pair to represent optimal state value which should be equal with ground truth optimal Q value. However, this idea is based on the assumption that Q values of state action pairs are estimated accurately. But the fact is the estimated Q values are not accurate before the target DQN is set with accurate parameter set after reasonable training process. So it is sure that for a certain state, some inaccurate Q values of state action pairs will be higher than ground truth optimal Q value for that state. Since in the target component of Bellman equation, max operator is used to select action with maximum Q value which should be optimal Q value for that specific state. However, in this inaccurate Q values situation, max operator will select action which has highest Q value that is probably higher than ground truth optimal Q value for that state. So Q value overestimation problem comes out for that state, as the overestimated Q value is in target component in Bellman equation for loss computation during the training process. This problem will lead to biased training, inaccurate parameter setting and finally degrade dialog agent performance. We use Table 1.1 to elaborate it.

State s Epoch1:target

DQN Q values

Epoch2:target DQN Q values

Ground truth: target DQN Q values

a1 Q(s, a1)=0.9 Q(s, a1)=0.95 Q(s, a1)=1.0

a2 Q(s, a2)=1.2 Q(s, a2)=1.1 Q(s, a2)=0.8

a3 Q(s, a3)=0.2 Q(s, a3)=0.4 Q(s, a3)=0.6

DQN selected action a2 a2 a1

estimated optimal state value Q*(s)=V*(s)=1.2 Q*(s)=V*(s)=1.1 Q*(s)=V*(s)=1.0 Table 1.1: Overestimation problem of DQN

In the Table 1.1, for state s, we could clearly see that the ground truth maximum Q vale is 1.0 which can represent optimal state value of s in the target component of Bellman equation. However, since the parameter set of target DQN is inaccurate from initial training process. Also, target DQN use max operator to choose action with maximum Q value to represent optimal state value in target component of Bellman equation. So in simulated epoch1 and epoch2, target DQN will choose a2 that corresponds maximum Q value 1.2 and 1.1 respectively. So overestimated optimal Q value problem exists to mislead training and degrade dialog policy.

In order to solve overestimation problem, double DQN (Van Hasselt et al., 2016) is proposed to mitigate overestimation problem by decoupling the action selection and action evaluation in loss

(8)

com-putation by using both being trained DQN and target DQN network respectively. It indeed mitigate overestimation problem since parameter set of being trained DQN is updated more frequently so that it is more accurate. The action selected by being trained DQN is more probable same with the action corresponding ground truth optimal Q value. However, the double DQN still have limitations as being trained DQN and target DQN has similarity so that even if we use being trained DQN to select action, which is still probably same to action with overestimated optimal Q value in target DQN for that state. Also, at the end of training process, target DQN is already set with good parameter while the parameter of being trained DQN is still not stable after every batch training. So the Q values of target DQN are more accurate, it is more reasonable to use estimated optimal Q value of target DQN to represent opti-mal state value for a state. We use Table 1.2 and Table 1.3 to explain the effects and limitations of double DQN. State s Epoch5: target DQN Q values Epoch5: being trained DQN Q values Epoch10: target DQN Q values Epoch10: being trained DQN Q values Ground truth: target DQN Q values a1 Q(s, a1)=0.8 Q(s, a1)=0.88 Q(s, a1)=0.98 Q(s, a1)=0.98 Q(s, a1)=1.0

a2 Q(s, a2)=1.2 Q(s, a2)=1.1 Q(s, a2)=1.1 Q(s, a2)=0.93 Q(s, a2)=0.8

a3 Q(s, a3)=0.35 Q(s, a3)=0.55 Q(s, a3)=0.45 Q(s, a3)=0.5 Q(s, a3)=0.6 target DQN

selected action a2 a2 a1

being trained DQN

selected action a2 a1(chosen)

estimated optimal state value from target DQN Q*(s, a2)=V*(s)=1.2(chosen) Q*(s, a1)=V*(s)=0.98(chosen) Q*(s, a2)=V*(s)=1.1 (not chosen) Q*(s)=V*(s)=1.0

Table 1.2: Effects and limitations of double DQN for mitigating overestimation

State s Epoch50: target DQN Q values Epoch50: being trained DQN Q values Ground truth: target DQN Q values

a1 Q(s, a1)=1.0 Q(s, a1)=0.98 Q(s, a1)=1.0

a2 Q(s, a2)=0.96 Q(s, a2)=0.99 Q(s, a2)=0.8

a3 Q(s, a3)=0.55 Q(s, a3)=0.6 Q(s, a3)=0.6

target DQN

selected action a1 a1

being trained DQN

selected action a2(chosen)

estimated optimal state value from target DQN Q*(s, a2)=V*(s)=0.96(chosen) Q*(s, a1)=V*(s)=1 (not chosen) Q*(s)=V*(s)=1.0

Table 1.3: Effects and limitations of double DQN for mitigating overestimation

In the Table 1.2 and Table 1.3, we can analyze the effects and limitations of double DQN intuitively, the exact mathematical explanation is in the background and model section of this thesis work. In epoch10, since being trained DQN parameter is updated more frequently, the Q values estimated by being trained DQN is more accurate than Q values estimated by target DQN comparing these values with ground truth Q values. So that target DQN uses the action a1 chosen by being trained DQN instead of its own choice action a2 to get the estimated state value 0.98. In this way, double DQN has mitigated overestimation. However, the limitations of double DQN are also shown in the Q values of epoch5 and epoch50. In epoch5, we notice that being trained DQN will choose action a2 for target DQN and the resulting Q value 1.2 is an overestimated state value for that state. This means the similarity of the two networks will still lead to overestimation of state value. Also, in epoch50, being trained DQN will choose action a2 for target DQN. However, the target DQN has more accurate Q values estimation because of more accurate and stable parameter set. It is better to use Q value of target DQN choice a1 to represent optimal state value according to ground truth.

(9)

Duel DQN (Wang et al.,2015) is proposed to estimate Q value more accurately by using two streams duel network structure that are state function and advantage function. In duel DQN research work, it proves that the duel structure can recognize the key action in every state so that it could estimate Q values more accurately compared with conventional one stream Q values estimation. However, max operator in target component of Bellman equation will still cause overestimation problem when the target network parameter is not trained good enough. That is its limitation. We use Table 1.4 to clarify this method. State s Epoch1: target DQN Q values Epoch1: target Duel DQN Q values Epoch40: target DQN Q values Epoch40: target Duel DQN Q values Ground truth: target DQN Q values a1 Q(s, a1)=0.8 Q(s, a1)=0.9 Q(s, a1)=0.95 Q(s, a1)=1.0 Q(s, a1)=1.0 a2 Q(s, a2)=1.2 Q(s, a2)=1.1 Q(s, a1)=1.05 Q(s, a1)=0.78 Q(s, a2)=0.8 a3 Q(s, a3)=0.2 Q(s, a3)=0.3 Q(s, a1)=0.5 Q(s, a1)=0.56 Q(s, a3)=0.6 DQN selected action a2 a2 a1 Duel DQN selected action a2 a1 a1 estimated optimal state value Q*(s)=V*(s) =1.2 Q*(s)=V*(s) =1.1 Q*(s)=V*(s) =1.05 Q*(s)=V*(s) =1.0 Q*(s)=V*(s) =1.0 Table 1.4: Effects and limitations of duel DQN for mitigating overestimation

By using the Table 1.4, we introduce the effects and limitations of duel DQN intuitively, the mathe-matical introduction will be detailed in the background and model section in this thesis. In epoch40 of training process, duel DQN could estimate Q values better than Q values estimation of DQN. So that duel DQN mitigates overestimation problem since its maximum Q value approximate optimal state value better. However, before the duel DQN network parameter is set good enough, since duel DQN does not change training algorithm, the max operator in Bellman equation still causes overestimation problem. This situation exists in epoch1 of the table.

There is another method to mitigate overestimation problem, averaged DQN is proposed in research community (Yu et al.,2018). The work uses the Monte Carlo idea, it estimates Q values for state action pairs by averaging Q values of state action pairs from K previously trained networks. In this way, we could reduce estimated Q value variance, and these Q values will approach ground truth Q values. So that max operator in Bellman equation could make more reasonable action choice for optimal Q value to represent state value. The idea will be detailed intuitively in the Table 1.5, the mathematical algorithm explanation will be introduced in background and model section.

State s Epoch x: target DQN 1 Q values Epoch x: target DQN 2 Q values Epoch x: target DQN 3 Q values Epoch x: averaged DQN Q values over K=3 trained target networks Ground truth: target DQN Q values a1 Q(s, a1)=0.9 Q(s, a1)=0.95 Q(s, a1)=1.1 Q(s, a1)=0.98 Q(s, a1)=1.0 a2 Q(s, a2)=1.0 Q(s, a2)=0.85 Q(s, a1)=0.75 Q(s, a1)=0.86 Q(s, a2)=0.8 a3 Q(s, a3)=0.2 Q(s, a3)=0.4 Q(s, a1)=0.5 Q(s, a1)=0.37 Q(s, a3)=0.6

selected action a2 a1 a1 a1 a1 estimated optimal state value Q*(s)=V*(s) =1.0 Q*(s)=V*(s) =0.95 Q*(s)=V*(s) =1.1 Q*(s)=V*(s) =0.98 Q*(s)=V*(s) =1.0 Table 1.5: Effects and limitations of averaged DQN for mitigating overestimation

In the Table 1.5, it can be recognized that estimated Q values of averaged DQN are more accurate compared with ground truth Q values of that state since average could reduce Q values variance. How-ever, since the average is based on outputs of K previously trained networks, the computational load is heavy.

(10)

1.2 Hypothesis

To address the above limitations, we need to overcome the optimal Q value overestimation problem in the target component of Bellman equation before target DQN is fully trained. Also, when the training process goes toward the final training stages, it is better to use optimal Q value of target DQN to rep-resent optimal state value of that state since target DQN is already set with good parameter set. Also, the computational demand should not be heavy. We implement a new reinforcement learning train-ing method called dynamic partial average model (DPAV) and implement several model variants. We also compare performance of model and its variants with existing baselines in RL based dialog policy research community.

Dynamic partial average deals with optimal Q value overestimation by taking partial average be-tween maximum Q value of target DQN and minimum Q value of target DQN with different allocated weights. The partial average will represent the optimal state value in the target component of Bellman equation for calculating training loss. Also, we change the assigned weights dynamically as the training process goes by using decay factors for the weights. In order to make the training process to be even more flexible, we use adaptive decay factors in the training process. Our goal is that we aim to use maximum Q value of target DQN to represent optimal state value based on the fact that target DQN is already set with good parameter set at the final stages of training process. In this hypothesis, this method could mitigate overestimation problem and achieve more accurate optimal state value approx-imation at the end of training stage. Also, computational demand is not heavy. In this section, we use Table 1.6 to introduce the hypothesis intuitively, the exactly mathematical algorithm will be detailed in the model section of this thesis.

State s Epoch 1: target DQN Q values Epoch 60: target DQN Q values Ground truth: target DQN Q values

a1 Q(s, a1)=0.8 Q(s, a1)=0.98 Q(s, a1)=1.0

a2 Q(s, a2)=1.4 Q(s, a2)=0.78 Q(s, a2)=0.8

a3 Q(s, a3)=0.2 Q(s, a3)=0.6 Q(s, a3)=0.6

DQN selected action a2 a1 a1

DPAV selected action a3,a2(weight=0.7) a3, a1(weight=1) DQN estimated optimal state value Q*(s)=V*(s) =1.4 Q*(s)=V*(s) =0.98 Q*(s)=V*(s) =1.0 DPAV estimated optimal

state value Q*(s)=V*(s) =0.7x1.4+0.3x0.2 =1.04 Q*(s)=V*(s) =0.98x1+0x0.6 =0.98

Table 1.6: Effects of dynamic partial average model for mitigating overestimation

In the Table 1.6, we elaborate effects of dynamic partial average model intuitively, and the exact mathematical explanation is deduced in model section. From the table, we could recognize that the ground truth optimal Q value is 1.0 which can represent optimal state value. In epoch1 of training process, since the parameter set of target DQN is inaccurate, so the estimated Q values are inaccurate. In the DQN algorithm, max operator will choose action a2 with corresponding optimal Q value 1.4 which is an overestimation of ground truth optimal Q value 1.0. However, dynamic partial average model only assigns weight 0.7 to maximum Q value in epoch1, and it assigns weight 0.3 to minimum Q value in epoch1. So the partial average is 1.04 that represents optimal state value. In this case, dynamic partial average model (DPAV) mitigates overestimation problem. In the epoch60 of the training process, the target DQN network is already set by good parameter so that estimated Q values are more accurate than before, that is why finally we use maximum Q value of target DQN to represent state value by giving weight 1 to optimal Q value in this epoch.

We do experiments in context of task completion dialogue system (Li et al.,2016;Rickel et al.,2000). We deploy our method dynamic partial average model and its variants. Then we compare performance with existing RL based dialog policy baselines in terms of dialog success rate after each training epoch in validation set. In order to test the robustness performance of dialog policy under different error granularity, we also do experiments with simulated error coming from natural language understanding module.

(11)

1.3 Contributions

In this thesis research, the main contributions are listed below:

• Performance investigation of several popular reinforcement learning algorithms (DQN, duel DQN, averaged DQN, double DQN) deployed in dialog policy of a Task completion Dialogue system. • Propose partial average based method which aims at solving the overestimation problem of

opti-mal Q value.

• Propose dynamic training through a decay factor that shifts target partial average towards the maximum Q value of the target DQN, considering this will be a better approximation of the opti-mal state value near the end of training.

• Propose the variants of dynamic partial average model such as duel DPAV model and averaged DQN with DPAV model, and investigate their performance deployed in the dialog policy module. • Robustness investigation of all above reinforcement learning based dialog policies in task

comple-tion dialogue system.

• Investigate the effect of number of target networks in the averaged DQN for dialog policy module. For each experiment topic, we do multiple experiments and then get the average to make the results to be more accurate and stable.

(12)

Chapter 2

Related work

2.1 Task completion dialogue system

2.1.1 Overview

Previous research has already produced some task completion dialogue system work (Chen et al.,2017). New reinforcement learning (RL) algorithm, DQN (Mnih et al.,2013) greatly increase performance of RL. Since then, many fancy RL methods (Wang et al.,2015;Bellemare et al.,2017;Kim et al.,2019) have been proposed, Zhao (Zhao and Eskenazi,2016a) first presented an end-to-end reinforcement learn-ing approach to dialogue state tracklearn-ing and policy learnlearn-ing in dialog management of task completion dialogue system (TDS).

As an import branch of spoken dialogue system, task completion dialogue system (TDS) has at-tracted increasing attention due to its alluring commercial values and great potentials. The task com-pletion dialogue system aims to assist the user to finish one task (e.g. finding medicine, booking flights and renting apartment). According to the structure of dialogue system, the existing task completion dialogue systems can be categorized into two groups: (1) end-to-end dialogue system (Chen et al.,2017;

Constantin et al.,2019). This method uses only one module to interact with users. It treats the dialogue system learning as a mapping from dialogue history to system response in an end to end format. (2) modularly connected pipeline structure (Zhang et al.,2019b,a). Usually, the pipeline based dialogue sys-tem has four modules, natural language understanding module first understand user utterance and then form the semantic frame. The dialog state tracker will process semantic frame to represent it as internal state. The dialogue policy will take some actions according to dialogue state and external knowledge base. Finally, the natural language generation module converts dialogue actions into natural language to converse with user in the interface.

This thesis work is based on pipeline structure and focuses on dialogue policy module especially. In this session, we mainly introduce progression of end to end system, natural language understanding, dialog state tracker and natural language understanding. The dialog policy will be detailed in the next section.

2.1.2 End to end dialogue system

The end to end dialogue system is proposed in many research works (Luo et al.,2019;He et al.,2019) because conventional pipeline based model has two limitations: (1) credit assignment problem. The feedback from user is hard to be propagated to each upstream module since each module is trained separately. (2) modules interdependence. In order to train and optimize the whole dialogue system, when one module is adapting to new application domain and trained with new data, the other modules of the whole dialogue system needs to be trained again. Also, the domain dependent slot and intent need to be renewed so this process needs a significant handcrafted efforts.

Many attempts have been made to construct end to end trainable model along with advances of end to end neural network models in recent years. Instead of the conventional pipeline structure, the end to end dialogue system uses only one modules to retrieve information and interact with user. (Bordes et al.,2016) and (Wen et al.,2016b) propose an end to end encoder decoder based training model which learns the mapping from dialogue history to system response. However, this kind of model is trained according to supervised learning format. Not only it needs a lot of training data, it also fails to find a

(13)

robust policy due to lack of exploration in the training data. To solve this problem, (Zhao and Eskenazi,

2016b) proposed to train dialog state tracker and dialog policy jointly by using reinforcement learning. Through asking the user several YES/NO questions, the dialog system could guess which person the user is thinking. (Li et al.,2017) trains an end to end dialogue system to complete one task like taxi booking.

2.1.3 Natural language understanding

Given user utterance, natural language understanding (NLU) module could map it into semantic slots that are pre-defined according to scenarios. Usually, two kinds of representation are there. One is utterance level, the NLU module could classify utterance category and user intent. The other one is word level, the utterance will be labeled with slot labels and be extracted information for slot filling.

Intent detection is performed to detect user utterance intent to classify it into predefined intent type. Deep learning models (Serdyuk et al.,2018;Zhang and Wang,2016) have been applied for intent detec-tion successfully. (Firdaus et al.,2018) uses convolutional neural network (CNN) to generate utterance vector representation as the feature patterns for intent classification. (Shen et al.,2014) uses CNN based model to classify intent type. The above similar methods could also be used for utterance domain clas-sification.

Slot filling is another task for natural language understanding (NLU) module. Usually, the slot filling is defined as the sequence labeling problem, where the words in the sentence are labeled with semantic labels. The input is a sequence of words, and the output is a sequence of slot labels. (Deoras and Sarikaya,2013) uses deep belief network and achieves superior results compared with traditional conditional random field based baselines. (Kurata et al.,2016;Zhu and Yu,2017) applies recurrent neural network based model for sequence slot labeling.

2.1.4 Dialog state tracker

Dialog state tracker is responsible for constantly representing dialog state. Some industrial dialogue systems use heuristic rule-based method to fill in the slot value pair for dialog state tracker by analyzing a high-confidence output from NLU module (Henderson et al.,2013). Many statistical methods (Gaši´c et al.,2010;Zilka and Jurcicek,2015) have been proposed to analyze the correlation relationship between turns given ambiguous output or uncertainty from NLU module.

The dialogue state tracker campaign (Henderson et al.,2013) describe the problem as a supervised sequential labeling task where the dialog state tracker fill in the slot values pairs based on sequential outputs of NLU module. In practice, dialog policy will be trained based on state tracker output. There is one fact that the probabilistic distribution in training data is different from live data (Henderson et al.,

2013). So that the performance of dialog state tracker will affect dialog policy performance. Sunjie (Lee,

2014) conduct research shows the positive correlations between state tracker performance and dialog model performance.

2.1.5 Natural language generation

The natural language generation (NLG) module converts the dialog action into natural language in the interface. It maps the semantic frame into intermediate form representing utterance like template structure. Then intermediate representation converts into natural language in the surface realization.

(Gao et al.,2019) and (Wen and Young,2019) introduced long short time memory network based model to NLG module. The dialog act type and slot value pairs are generated as vector input to NLG to ensure it could generate intended meaning. Research work (Wen et al.,2015a) uses a CNN reranker together with forward RNN generator, and backward RNN reranker. All the sub-modules are trained jointly to generate utterance conditioning on dialog action.

2.2 Dialogue policy learning

Dialog policy is an essential component in the dialogue system to decide the dialogue quality. It gener-ates next system action according to state representation from the state tracker. There are multiple ways to set up dialogue policy, such as rule based method, planning based method, statistical method and reinforcement learning based method. Since reinforcement learning has exploration property, so that it

(14)

could explore multiple trajectories in the training data to find out a robust dialogue policy. In this thesis work, the research work is based on reinforcement learning based dialog policy.

Research work (Zue et al.,1992) proposed ruled based dialog policy which combines dialog action with dialog status into specific sequences. The dialog management operates dialog status through status stack push in and pop out along with corresponding dialog actions. Although this method is stable in the specific application domain, it is not flexible since all the rules are fixed, it is not possible for user to answer and query out of the fixed patterns. Also, when the dialogue system is adapted into a new domain, all the rules need to be renewed which needs heavy human handcraft.

In order to solve inflexible problem, research (Pinault and Lefevre,2011) uses statistically probabilis-tic model to set up dialog policy. A state action probability matrix is built so that for each state there is one probability distribution over all possible action candidates. However, this method is not effective when the state action space is large.

Zhao (Zhao and Eskenazi, 2016a) first proposes to optimize dialogue policy through using rein-forcement learning (RL) which achieves superior results compared with baselines. Peng (Peng et al.,

2018a) proposes adversarial advantage actor-critic model to recover a reasonable reward function from human demonstration corpus to complement predefined reward function. However, the model is not efficient to handle complex task which contains several sub-tasks to be achieved collectively. Research work (Tang et al.,2018) combines RL with hierarchical task decomposition to decompose the task into multiple granularities and each time the dialog agent only focuses on an easier task. Experiments show that both of the above methods can learn a pragmatic policy.

A model based RL model named as deep dyna-Q (Peng et al.,2018b) is proposed. It incorporates a world model into dialogue system to mimic user response so the it is more efficient to train the dialog policy with this setting.

In this thesis work, we investigate model free RL dialog policy and propose dynamic partial average mechanism, then we compare the model and its variants with baselines about dialog performance. Com-pared with double DQN, for target component of Bellman equation, our model selects action candidates according to Q value of target DQN instead of being trained DQN. Duel DQN mainly improves Q value approximation accuracy, but it does not improve algorithm training methodology. The dynamic partial average mechanism upgrades training algorithm. Averaged DQN uses average of K previous Q values of state action pair from k previously trained DQNs to estimate Bellman equation target component for a specific state action pair. Our model dynamic partial average also uses partial average between max-imum Q value and minmax-imum Q value for one state to represent target component of Bellman equation, but the Q values are taken from same target DQN estimation.

(15)

Chapter 3

Background

3.1 Reinforcement learning (RL)

3.1.1 Reinforcement learning progression

The purpose of reinforcement learning (RL) (Sutton et al.,1998) is to learn a policy that can solve se-quential decision making problem through optimizing the cumulative future environment reward. Q learning (Watkins,1989) is the one of famous reinforcement learning algorithms. However, it is also known for overestimation since it chooses the maximum Q value of state action pairs for every state and evaluates the being trained deep Q-learning network (DQN) with maximum Q value. In the previous research (Thrun and Schwartz,1993), the overestimation problem is attributed to insufficiently func-tion approximafunc-tion. Whether the overestimafunc-tion problem will cause negative effects is not recognized. Mnih (Mnih et al.,2015) proposed one new Reinforcement learning algorithm that can even reached human level performance in the game playing context even with overestimation problem. Related re-search (Kaelbling et al.,1996) also shows overestimation is also a kind of exploration method in RL. The following research (Van Hasselt et al.,2016) affirmatively answered that the overestimation problem will harm performance by doing experiments in Atari 2600 games.

Hado (Van Hasselt et al., 2016) proposed a new reinforcement learning training algorithm called double DQN (DDQN). The idea was first proposed in the tabular setting. Hado generalized work into arbitrary function approximation, including deep neural network. Wang (Wang et al.,2015)proposes a new Q value approximation network structure called duel DQN that contains two streams network for approximating Q value instead of conventional one stream, which could notice the key action under certain state by constructing advantage network. Since duel DQN could estimate Q value more accurate, this method mitigates Q value overestimation problem. Then, Mnih (Mnih et al.,2016) proposed one conceptually simple and lightweight framework for deep reinforcement learning. This method uses asynchronous gradient descent for optimization of deep neural network controllers. The experiments in Atari domain games show that asynchronous gradient descent variant of Actor-Critic model surpasses many RL algorithms in Atari games. Research work (Anschel et al.,2017) uses the average of K previous Q values to represent the estimated Q value for each state action pair. This method use average to reduce variance so that estimation of each Q value would be more accurate so that overestimation problem is also mitigated.

In this work, our proposed method aim at solving overestimation problem by proposing dynamic partial average mechanism.

3.1.2 Reinforcement learning for dialogue model

Based on promising performance of reinforcement learning, reinforcement learning has been a pop-ular method for learning dialog policy of task completion dialogue system (Lee and Eskenazi,2012;

Georgila and Traum,2011). Dialog policy can be modeled as partially observable markov decision pro-cess (POMDP) which takes uncertainty in both user goals and NLU module into account. Previous re-search (Williams and Young,2007) shows POMDP based system performs much better than rule based dialog policy especially under NLU errors. Wen (Wen et al.,2016b) proposed a network based trainable dialogue system. They use a new variant of encoder-decoder model to learn the mapping from dia-logue history to system response. Inspired by their idea, Zhao (Zhao and Eskenazi,2016b) used Deep

(16)

Q network to learn the dialog strategic plan, which is one progression in this area. Peng (Peng et al.,

2018b) first incorporated planning into dialog agent training process by building world model to sim-ulate real user experience. Through training dialog policy by using real user experience and simsim-ulated user experience, the model achieves good performance to reduce the negative effects of lack of language complexity of user simulator. In this research work, we upgrade Deep Q network training algorithm by implementing dynamic partial average mechanism in the Bellman equation.

3.1.3 Markov decision process

Mostly, reinforcement learning (RL) models are based on markov decision process (MDP). An MDP case is a tuple (S, A, P, γ, R), where A is a set of all possible actions; S is a set of states; P defines transition probability P (s0_{|s, a); R defines the immediate reward for every state action pair R(s,a); γ is the} dis-count factor for future reward. The objective of reinforcement learning is to find out the optimal policy π∗_{that can maximize the expected cumulative reward during training process (}_{Sutton et al.}_,₁₉₉₈_). Usu-ally, markov decision process assumes full observability of involved states of whole world, which is rarely the realistic in the world application domains. So partially observable markov decision process (POMDP) takes place. It also considers the uncertainty in the state variable. Normally, a POMDP case is a tuple (S, A, P, γ, R, O, Z). O is a set of observations, and Z defines an observation probability P (o|s, a). For the other variables, they are same with POMDP. Solving the POMDP needs to compute the belief state b(s), that is the probability distribution of all possible states, so thatP

sb(s) = 1. Monahan ( Mona-han,1982) has shown that belief state is sufficient for optimal control, so that our objective is finding π∗ : b → a which can maximize all the future return.

3.1.4 Q learning

In order to solve the sequential decision problem, we can learn the estimates of value for each ac-tion, which is defined as the expected sum of future rewards when taking that action under certain policy. For example, under a given policy π, the value of an action a in the state s is Qπ(s, a) = E [R1+ γR2+ . . . |S0= s, A0= a, π], where γ ∈ [0, 1] is a discount factor that trades off the importance of immediate reward and later rewards. For each state action pair, the optimal Q value is Q∗(s, a) = maxπQπ(s, a). So finally, the optimal policy is easily derived from optimal Q values by selecting highest value of state action pair in each state. Estimation of optimal state action values can be learned through using Q-learning (Watkins, 1989), that is a kind of temporal difference learning (Sutton et al.,1998). However, for many interesting problems, it is not possible to learn all values of each action in each state separately. In order to solve this problem, current research mainly learn parameter-based value function Q (s, a; θt). The standard Q-learning update is achieved after taking action Atin Stand comes into St+1 with immediate reward Rt+1. We get the following formula in this process :

θt+1= θt+ α

Y_tQ− Q (St, At; θt)

∇θtQ (St, At; θt) , (3.1)

where α is the scalar step size and YtQis represented as follows: YtQ= Rt+1+ γ max

a Q (St+1, a; θt) . (3.2)

This parameter training process is similar with stochastic gradient descent. The objective is updating Q (St, At; θt)towards Y

Q t .

3.1.5 Prioritized replay

Prioritized replay is proposed as an innovation in the research community that further improves the state-of-the-art. The key idea is to increase the replay probability of experience tuples that have high expected learning progression measured via the value of absolute TD-error. This method leads to both better learned policy quality and to faster learning.

(17)

3.2 Task completion dialogue system

3.2.1 Natural language understanding

The natural language understanding (NLU) (Kurata et al., 2016) module is usually recurrent neural network model composing of several Long-short term memory cells. This NLU module (Hakkani-Tür et al.,2016) could do slot values filling and intent prediction simultaneously. In order to implement joint modeling of slots and intent, the predicted tag sequence is a concatenated IOB-format slot tags and an intent tag. For each utterance, we add an additional token < EOS > at the end of utterance. Its supervised tag is an intent tag. For the preceding words in the utterance, the supervised tags are all IOB-format tags. So that the sequence to sequence training method can be used, the last hidden state of utterance sequence should contain the condensed semantic representation of whole utterance sequence. So that it could be utilized for intent prediction.

The popular IOB (in-out-begin) format is used to represent slot tags, the example is shown in the fol-lowing figure:

Figure 3.1: IOB format: an example utterance with annotations of slot tags and intent tag. (Image from (Li et al.,2017))

The task of NLU is to classify the user utterance intent along with domain specific intents and fill in some related slots to form resulting semantic frame.

~

x = w1, . . . , wn, < EOS >, (3.3)

~

y = s1, . . . , sn, im, (3.4)

where ~y contains allocated slot tags skand intent tag imand ~x is input sentence sequence. The NLU module (Hakkani-Tür et al.,2016) is based on LSTM,

~

y = LSTM(~x). (3.5)

Given the utterance sequence ~x, the objective of NLU is to maximize the conditional probability of intent and slots ~y:

p(~y|~x) = n Y i p (si|w1, . . . , wi) ! p (im|~y) . (3.6)

The weights of LSTM can be trained by using backpropagation to get the maximum of conditional likelihood. Since the predicted tag sequence is a set of concatenated IOB format slot tags and intent tag. So the model can be trained in a supervised way.

3.2.2 State tracker

Having the output of NLU module, such as Request(starttime; date=this weekend; genre=action). Three operations are executed:

• Symbolic query is formed to retrieve available results from database.

• The status of state tracker will be updated according to available results in database and last user dialog action.

• Based on the updated status, the state tracker prepare state representation stfor following dilog policy. Usually the state representation incorporates the latest agent action (request(number of people)), the retrieved results from database, the latest user action(e.g., request(starttime, date=this weekend, genre=action)), turn information, etc.

(18)

3.2.3 Dialog policy

Dialog policy is an essential module in task completion dialogue system which selects the optimal dialog action according to dialogue state and information in the knowledge base. In this thesis work, we use RL based dialog policy. In this section, we introduce the algorithms of baselines in this thesis: DQN, double DQN, duel DQN and averaged DQN.

Deep Q-network (DQN)

The value function discussed in the above section usually has high dimension. To approximate them more accurately, the deep Q-network: Q(s, a; θ) with parameter θ is used. In order to learn this network, we optimize the following loss function at training process:

L (θ) = Es,a,r,s0 h yDQN− Q (s, a; θ)2i , (3.7) and yDQN = r + γ max a0 Q s 0_{, a}0_{; θ}−_, (3.8) where θ−is the parameters of target network, that will be updated by the parameters of training DQN after every several iterations. We could also learn the parameter of θ of training DQN online by using standard Q learning. However, this method performs poorly in practice. An important innovation (Nair et al.,2015) was proposed. In the proposed method, the parameters of target network are frozen for a fixed number of iterations while updating the parameters of online network Q (s, a; θ) by gradient descent. This method greatly improves the stability of the algorithm.

The exactly specific update rule is:

∇θL (θ) = Es,a,r,s0 yDQN− Q (s, a; θ) ∇_θQ (s, a; θ) , (3.9)

this update method is model free based on the fact all the rewards and following states come from the environment. Also, it is an off-policy method since the decision is made under behavior policy( epsilon greedy policy) which is different from online policy that is being learned.

Another key technique behind the achievement of DQN is experience replay (Lin,1993). During the training process, the agent accumulates a dataset Dt = {e1, e2, . . . , et} where the experiences are et = (st, at, rt, st+1)from many episodes. When training DQN online, instead of only using current experience as described in the temporal difference learning. The network is trained by uniformly and randomly sampling mini-batches from experiences. The sequence of loss thus haves the following form:

L (θ) = E(s,a,r,s0_{)∼U (D)}

h

yDQN− Q (s, a; θ)2i

. (3.10)

First, it improves data efficiency by reuse the data samples in multiple updates. Second, the training will not be biased since uniform sampling from the replay buffer significantly reduces the correlation of

(19)

samples used in the online DQN training. We summarize the algorithm as follows:

Algorithm 1:DQN

Initialize replay memory D to capacity N

Initialize action-value function Q with random weights

for episode =1,...,M do

Initialise state s1

for t=1,...,T do

With probability select a random action at otherwise select at= maxaQ∗(st, a; θ)

Execute action atin environment, observe reward rtand come into state st+1 Store transition (st, at, rt, st+1)in D

Set st= st+1

Sample random minibatch of transitions (sj, aj, rj, sj+1)from D Set yj=

rj for terminal sj+1

rj+ γ maxa0Q (s_j+1, a0; θ−) for non-terminal s_j+1

Perform a gradient descent step on (yj− Q (sj, aj; θ)) 2

to update θ Replace target parameter θ−_{←− θ after every L iterations.}

end end

Double DQN

The previous sections mainly describe the essential components of DQN as proposed in the work (Mnih et al.,2015). Double Q learning (Hasselt,2010) was proposed to solve the overestimation problem. In the Q learning and DQN, the target network uses same Q value to select and evaluate action. This will lead to overestimated problem. To mitigate this problem, the following formula is used for target component in Bellman equation: yDDQN = r + γQ s0, arg max a0 Q (s0, a0; θ) ; θ− . (3.11)

Actually, double DQN is similar with DQN, but it changes the target yDQN_{to y}DDQN_{. We summarize} the algorithm of double DQN as follows:

Algorithm 2:Double DQN

Initialize replay memory D to capacity N

Initialise state s1

for t=1,...,T do

Set st= st+1

Sample random minibatch of transitions (sj, aj, rj, sj+1)from D

Set yj =    rj for terminal sj+1 rj+ γQ sj+1, arg max a0 Q (sj+1, a0; θ) ; θ− for non-terminal sj+1 Perform a gradient descent step on (yj− Q (sj, aj; θ))2to update θ

Replace target parameter θ−←− θ after every L iterations.

end end

(20)

Duel DQN

In order to estimate Q value more accurately, the research work (Wang et al.,2015) proposes one new network structure that has two streams to separately estimate advantages for each action under certain state and state-value (scalar). Then the two streams are combined together to represent Q value for state action pair under each possible action.

Figure 3.2: Duel DQN: network structures of conventional Q value estimation and duel DQN Q value estimation(Image from (Wang et al.,2015))

Since duel DQN does not change reinforcement learning training algorithms, it only change the Q value estimation. Duel DQN is so flexible that it could combine with any existing and future reinforce-ment learning algorithms. The formula of duel DQN for Q value estimation is as follows:

Q(s, a; θ, α, β) = V (s; θ, β)+ A(s, a; θ, α) −_|A|1 P a0A (s, a0; θ, α) , (3.12)

where θ denotes the parameters of preceding shared network modules, while α and β are the parameters of advantage module and state value module separately.

Averaged DQN

The averaged DQN model is an extension of conventional DQN model. Averaged DQN uses the average of K previously learned Q values to estimate the Q value of current state action pair. The averaged DQN model mitigates overestimation by using average to make the estimated Q value approach real value. The computational effort compared with conventional DQN is, K times more forward pass through the deep Q network. But the backpropagation does not go through the K estimation networks, and it only goes through currently being trained network. The output of this method is the average over outputs of K previously learned networks. The algorithm is summarized as follows:

Algorithm 3:Averaged DQN

Initialize replay memory D to capacity M

Initialize action-value function Q with random weights Initialize exploration procedure Explore(·)

for i=1,...,N do

QA_i−1(s0) =_K1 PK

k=1Q (s0; θi−k)

ys,ai = EBr + γ maxa0QA_i−1(s0, a0) |s, a

θi≈ argminθED h yi s,a− Q(s, a; θ) 2i Explore(·), update D end output: QA N(s, a) = 1 K PK−1 k=0 Q (s, a; θN −k)

(21)

3.2.4 Natural language generation

Given the dialog actions, the natural language generation (NLG) module generates natural language texts. Due to limited language generation ability and labeled training data, a pure model based NLG module does not generalize well, which could probably import external noise for following dialog policy training. So in order to improve model performance, a hybrid approach is used, which consists of:

• Template based NLG generates rule based language according to existing templates if the dialog act is in predefined templates

• Model based NLG is trained in a sequence to sequence schema. It takes in dialog action and generates template like sentence with placeholder through using LSTM decoder. Then the post-processing method (Wen et al.,2016a) is used to replace the placeholder with real values.

Also, in order to make sure the generated sentence sketch is logical. Beam search is applied. For generating next token, the NLG module iteratively considers top k best possible sentences. In order to trade off the computational speed and performance, we choose 3 for k value. So finally, in this hybrid approach, if the dialog action can be matched in the predefined rule based language template, the ruled based utterance will be generated by template NLG. Otherwise, the model based NLG will generate utterance according to dialog act input. Then the placeholders will be filled with values from dialog action.

(22)

Chapter 4

Model

4.1 Overview

Dialog policy is an essential component in task completion dialogue system, which affects the perfor-mance of whole dialogue system. Based on the state representation st, the dialog policy can make decision ataccording to π (st). Reinforcement learning (RL) can be used to optimize the policy π. Before the RL dialog agent can be trained mature enough, at the early stage, a rule based dialog policy is em-ployed to warm start the system. The experience tuple (st, at, rt, st+1)gathered from rule based dialog agent can then be used to train RL based dialog agent until the current RL dialog agent performance overcomes rule based dialog agent. Then the original experience replay pool will be flushed and be refilled by experience tuples of RL based dialog agent.

The reinforcement learning based dialog policy is usually represented as a deep Q-network (Mnih et al.,2015), that takes in stcoming from state tracker and generates Q (st, a; θ)for all possible actions. Among the training process, two major tricks are used: target network usage and experience replay. In the training process, the experience replay buffer with dynamically changing buffer size and -greedy exploration are used. At each simulation epoch, we first simulate multiple dialogue turns, and add these state transitions tuples (st, at, rt, st+1) into our experience replay buffer for following training steps. And in one simulation epoch, the parameters of current DQN will be trained for multiple times depending on batch size and experience replay buffer size. The target DQN will only be updated once at the end of one epoch.

Experience replay is a critical strategy in reinforcement learning. In the experience replay strategy, we accumulates experience tuples from each simulation epoch. At each simulation epoch, we will run multiple dialogues on user simulator. If the current DQN performance overcomes former best perfor-mance of training DQN. We will flush experience replay buffer and refill it by using latest experience tuples from current DQN.

Conventional training algorithm for DQN uses the same target DQN to select action and evaluate action, so that it will cause overestimation problem. (Please refer to Introduction section of this thesis to get more ideas for overestimation.) Multiple algorithms could mitigate overestimation problem such as duel DQN, double DQN and averaged DQN. But they all have their own limitations.

In this thesis work, we investigate the performance of DQN, double DQN, duel DQN and averged DQN. Also, we proposed our contributed training model : dynamic partial average model that achieves better performance than some baseline models. Also, we investigate the variants of dynamic partial average (DPAV) model such as duel DPAV, averaged DPAV and duel averaged DQN.

4.2 Dynamic partial average (DPAV)

4.2.1 Motivation and hypotheses

In this section, we introduce the mathematical deduction about motivation for dynamic partial average model. First, we know the Q value of state action pair can be represented as expectation of all the rewards from current state, the formula is written as follows:

(23)

where γ is the discount value for future reward since the immediate reward should be more valuable. For state value of one state, it is the expectation of Q values for all possible state action pairs under that state. And we have the following formula:

V (s) = Ea←π(s)[Q(s, a)] . (4.2)

According to Bellman equation, the Q value of one action under certain state could be represented by all the possibly following state values. The formula is:

Q(s, a) = Es0[Ra_s+ γV (s0)|S_t= s, A_t= a] , (4.3)

the Q(s, a) in this equation is the target component in the loss function computation.

When we execute the trained model, we always use greedy policy to ensure we choose the action with maximum Q value under each state. In the conventional DQN algorithm, one assumption is Q value of every state action pair is accurate. So the optimal state value will be same with maximum Q value for certain state under all possible action candidates. The formula is written as follows:

V (s) = max a Q(s, a), (4.4) then we have: Q(s, a) = Es0[R_sa+ γV (s0)|S_t= s, A_t= a] = Es0 h Ra_s+ γ max a0 Q(s 0_{, a}0_)|S t= s, At= a i , (4.5)

and the loss function can be written as:

L (θ) = Es,a,R,s0 h yDQN− Q (s, a; θ)2i , (4.6) where yDQN = Ra_s+ γ max a0 Q s 0_{, a}0_{; θ}−_. _(4.7)

However, the overestimation comes out since these formulas are based on the assumption that Q value estimation is accurate. Actually, it is not accurate since the parameter based Q value estimation are not totally accurate during the training process. So the maximum operator would choose overestimated Q value and cause inaccurate training in the loss function computation.

In order to solve overestimation problem, the double DQN (Van Hasselt et al.,2016) changes the training process by adjusting the target component of loss function. The updated formula is as follows:

yDDQN = Ra_s+ γQ s0, arg max a0 Q (s0, a0; θ) ; θ− , (4.8)

it decouples the action selection and action evaluation by using both target DQN and being trained DQN. But this algorithm also has limitations: (1) since the target DQN is updated by being trained DQN after every several iterations, so even if the being trained DQN could do action selection, it is still possible to choose the action corresponding to overestimated Q value in target DQN. (2) at the ending stages of training process, the target DQN is already set by good parameter set, while the parameter set of being trained DQN still vibrates after the training of every batch. So it is more reasonable to use maximum Q value of target DQN to represent state value for that specific state as Q value of action chosen by being trained DQN could not estimate state value accurately with unstable parameter.

The reason of overestimation problem for training conventional DQN is the Q value estimation is not accurate. Based on this fact, new Q value estimation algorithm duel DQN (Wang et al.,2015) is proposed for more accurate Q value estimation by constructing two streams named state function stream and advantage function stream to estimate Q value. The mathematical formula is as follows:

Q(s, a; θ, α, β) = V (s; θ, β)+ A(s, a; θ, α) −_|A|1 P a0A (s, a0; θ, α) . (4.9)

For each state, duel DQN could recognize which action is more valuable so that Q value estimation is more accurate. However, in the training process, before the parameter is trained good enough, max operator in Bellman equation could still be possible to choose the overestimated Q value for each state and import it into loss computation function to cause overestimation problem.

(24)

Averaged DQN is also another schema to mitigate overestimation problem. The core idea is averaged DQN use average of Q values from K previously trained network to estimate Q value for each state action pair. In this way, the variance of Q value estimation is reduced so that Q value estimation is more accurate. The averaged DQN algorithm is Algorithm 3 in the background section of this thesis.

However, the overestimation problem caused by max operator in the loss function is still existing for some extent. Also, computational complexity is higher compared to above two methods.

Based on the above limitations of existing models, we have following hypotheses:

• The partial average between maximum Q value and minimum Q value of one specific state action pair could be a better estimation for state value.

• The weight factor should be dynamic and give more weight to maximum Q value of target DQN, since at the ending of training process, the maximum Q value of target DQN could represent state value better based on the fact the target DQN is already set with good parameter set.

4.2.2 Methodology

In the dynamic partial average model, we change the target component of loss function, the formula is as follows: L (θ) = Es,a,r,s0 h yDP AV − Q (s, a; θ)2i , (4.10) and yDP AV = r + γ((1 − λ) max a0 Q (s 0_{, a}0_{, θ}0_{) + λ min} a00 Q (s 0_{, a}00_{, θ}0_)), (4.11) where γ is the discount factor for future state value, and λ is the average weight factor. λ will decay according to predefined rate as the training process goes. So the formula is as follows:

λ = λ ∗ d, (4.12)

where d is decay rate. The algorithm of dynamic partial average model is summarized as follows:

Algorithm 4:Dynamic partial average model Initialize replay memory D to capacity N

Initialize decay rate d (d can be set adaptively along the training process)

Initialise state s1

for t=1,...,T do

Set st= st+1

Sample random minibatch of transitions (sj, aj, rj, sj+1)from D

Set yj=

rj for terminal sj+1

rj+ γ((1 − λ) maxa0Q (s0, a0, θ0) + λ min_a00Q (s0, a00, θ0)) for non-terminal s_j+1

Perform a gradient descent step on (yj− Q (sj, aj; θ)) 2

to update θ Replace target parameter θ−←− θ after every L iterations.

Update average weight λ ←− λ ∗ d after every U iterations.

end end

4.3 Duel dynamic partial average

Duel dynamic partial average (Duel DPAV) is a variant of dynamic partial average model. The whole algorithm structure is same with DPAV model, and the only difference is Duel DPAV model use duel

(25)

structure to estimate more accurate Q value instead of conventional one stream structure for Q value estimation. The formula of Q value estimation is as follows:

Q(s, a; θ, α, β) = V (s; θ, β)+ A(s, a; θ, α) −_|A|1 P a0A (s, a0; θ, α) . (4.13)

4.4 Averaged DQN with dynamic partial average

In this section, we combine dynamic partial average mechanism with averaged DQN model together. Since the average could reduce variance so that Q value estimation could be more accurate. The whole algorithm is represented as follows:

Algorithm 5:Averaged DQN with DPAV Initialize replay memory D to capacity M

Initialize action-value function Q with random weights Initialize exploration procedure Explore(·)

Initialize decay rate d

for i=1,...,N do QA i−1(s0) =K1 PK k=1Q (s0; θi−k) yi

s,a= EBr + γ((1 − λ) maxa0QA_i−1(s0, a0) + λ min_a00QA_i−1(s0, a00))|s, a

θi≈ argminθED h yi s,a− Q(s, a; θ) 2i Explore(·), update D

Update weight factor λ ←− λ ∗ d after every U iterations.

end output: QA N(s, a) = 1 K PK−1 k=0 Q (s, a; θN −k)

4.5 Duel averaged DQN

Duel averaged DQN is an ensemble model. Duel Q network structure could make Q value estimation more accurate by importing duel network structure, and averaged DQN could reduce Q value estima-tion variance to make estimated Q value approach ground truth Q value. In this model, we combine two ideas together to form an ensemble model. The algorithm structure is same with averaged DQN in the background section of this thesis.

(26)

Chapter 5

Experiment setup

5.1 Research questions

The following research questions will be answered in the experiment:

• (RQ1) What are all models (including baselines and proposed model variants) performance under error free conversation context?

• (RQ2) What are all models performance under simulated error for natural language understanding module?

• (RQ3) What are suitable hyperparameter settings for dynamic partial average model? • (RQ4) What is effect of number of target networks for averaged DQN?

5.2 Dataset

The dataset is proposed in the research work (Li et al.,2016) by collecting 280 simulated dialogues via Amazon Mechanical Turk. Then all dialogues will be annotated with intent labels and slot labels with fixed schema. For each dialogue, there is one user goal that the whole discourse is centered on. The user goal is explicit to user simulator during the simulated dialogue, while it is implicit to dialogue agent. In the Table 5.1, one sampled user goal is presented. In the research work, there are 29 slots (i.e., num-berofpeople, theater, starttime, moviename, etc.), and 11 intents (i.e., confirm_answer, confirm_question, inform, request, etc.).

INTENT: request, inform, deny, confirm_ques, confirm_answer, greeting,closing, not_sure, multiple_choice, thanks, welcome

SLOT: actor, actress, city, closing, critic_rating, date, description, distanceconstraints, greeting, implicit_value, movie_series, moviename, mpaa_rating,numberofpeople, numberofkids, taskcomplete, other, price, seating, starttime, state, theater, theater_chain, video_format, zip, result, ticket, mc_list.

User goal {

"request_slots": { "ticket": "UNK", "theater": "UNK", "starttime": "UNK" }, "diaact": "request",

"inform_slots": { "date": "tomorrow", "city": "philadelphia", "numberofpeople": "4", "movien-ame": "deadpool", "zip": "19101" }

}

Anti-overestimation Reinforcement Learning Policy for Task-completion Dialogue System

MSC

ARTIFICIAL

INTELLIGENCE

M

ASTER

T

HESIS

Anti-overestimation Reinforcement Learning

Policy for Task-completion Dialogue System

CHANG

TIAN

March 11, 2020

Daily Supervisor:

Dr. Pengjie Ren (UvA)

Assessor and Examiner:

Prof.Dr. Maarten de Rijke (UvA)

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Motivation

1.2

Hypothesis

1.3

Contributions

Chapter 2

Related work

2.1

Task completion dialogue system

2.1.1

Overview

2.1.2

End to end dialogue system

2.1.3

Natural language understanding

2.1.4

Dialog state tracker

2.1.5

Natural language generation

2.2

Dialogue policy learning

Chapter 3

Background

3.1

Reinforcement learning (RL)

3.1.1

Reinforcement learning progression

3.1.2

Reinforcement learning for dialogue model

3.1.3

Markov decision process

3.1.4

Q learning

3.1.5

Prioritized replay

3.2

Task completion dialogue system

3.2.1

Natural language understanding

3.2.2

State tracker

3.2.3

Dialog policy

3.2.4

Natural language generation

Chapter 4

Model

4.1

Overview

4.2

Dynamic partial average (DPAV)

4.2.1

Motivation and hypotheses

4.2.2

Methodology

4.3

Duel dynamic partial average