Dialogue Act Classification using Inductive Graph Learning

(1)

MSc Artificial Intelligence

Master Thesis

Dialogue Act Classification

using Inductive Graph Learning

by

Arvid Lindström

12365718

July 6, 2020

48 ECTS November 3rd 2019 - June 30th 2020

Supervisor:

Dr. Svitlana VAKULENKO

Assessor:

Dr. Evangelos KANOULAS

FACULTY OF SCIENCE

(2)

Declaration of Authorship

I, Arvid Lindström, herby declare that the thesis titled ”Dialogue Act Classification using Inductive Graph Learning” and the presented research, are entirely my own. Furthermore, I confirm that the complete work was done in pursuit of a research degree from the Masters program in Artificial Intelligence at the University of Amsterdam. All sources of assistance, as well as previous work upon which this thesis is built, have been properly indicated as such. Any work which may have been conducted with the assistance of others besides myself has been dutifully indicated.

Signed: Arvid Lindström. Date: June 29th, 2020.

(3)

UNIVERSITY OF AMSTERDAM

Abstract

Dialogue Act Classification

using Inductive Graph Learning

by Arvid Lindström

Abstract

With the advent of hierarchical neural networks, utilizing self-attention modules, state of the art results have been achieved on the dialogue act and user intent classification tasks. In this work, we utilize the recent advances in graph representation learning and reframe the Dialogue Act classification task as a graph learning problem. In order to take into account directional and intent related structures of dialogue, we device a schema for converting dialogue corpora into graphical representations, upon which we apply the widely used GraphSAGE framework by Hamilton et al. We study and analyse the viability of our method and conclude that using graphical representations improves performance over independent context embeddings of utterances on human to human dialogue. We also show results comparable to the state of the art on corpora of online forum dialogue, using independent context embeddings, and discuss challenges faced when applying graph learning to dialogues with few utterances. The thesis concludes with an in depth analysis on the behavior of various aggregation functions used to perform message passing using random walks, and why such graph models fail to generate discriminative features on utterance level.

(4)

Acknowledgements

Firstly, I want to acknowledge and thank Dr. Svitlana Vakulenko for her invaluable guidance and support during the creation of this thesis. Without which, surely the project would not have been possible. I also extend my gratitude to Dr. Evangelos Kanoulas for being the second assessor of this work. Furthermore, I direct my great thanks towards Yunuscan Kocak and Gianluigi Bardelloni at the KPN Data Science Lab for their mentorship and openness to explore research most beneficial to my thesis. Last, but not least I want to thank my friends and my parents for their never ending love and support during these last nine months.

(5)

List of Figures

3.1 Illustration of GraphSAGE pipeline. . . 9

4.1 Training and inference pipeline of our method based on GraphSAGE. . . 12

4.2 Algorithm: Constructing directed graph from corpusD . . . 13

4.3 Illustration of resulting dialogue graph. . . 14

5.1 t-SNE visualization of initial ELMo embeddings of utterances. . . 21

6.1 t-SNE visualization of GraphSAGE embeddings of utterances. . . 27

(8)

List of Tables

1.1 Example of dialogue corpus annotated with dialogue acts. . . 1

5.1 Overview of datasets used in this thesis. . . 15

5.2 GraphSAGE model architectures used. . . 19

6.1 Baseline results for Dialogue Act and User-intent classification model . . . 22

6.2 GraphSAGE results on SwDA dataset. . . 23

6.3 GraphSAGE results on MRDA dataset. . . 24

6.4 GraphSAGE results on MSDialog dataset. . . 25

6.5 GraphSAGE results on MANtIS dataset. . . 26

(9)

List of Symbols

D : A corpus containing dialogues di ∈ D, i ∈ [1, 2, . . . N]

Adi : The set of actors ak∈ Adi belonging to dialogue/conversation di

Udi : The ordered set of utterances uj ∈ Udi which constitutes

conversa-tion di

uj : A single utterance at position j defined as a set of words uj :=

[w0, w1, ...wM₋₁] of variable length M

Ydi: The alignment of sequential labels that correspond to the ordered

sequence of utterancesUdi which constitutes dialogue di.

Y_D: The set of discrete dialogue act labels that appear in corpusD. GD: A graphical representation of dialogueD defined as a set of vertices

and edgesG := {V, E}

vk ∈ V : Nodes in graph GD representing the set of all unique utterances

occurring in corpus D through the mapping uj → vk, ∀j,i uj ∈ Udi, di∈ D such that |V| =

∑N

i=1|Udi|. Graph grows further in size

if actor nodes are also added.

ef low: Edge ef low ∈ E connecting nodes vk ∈ V in dialogue graph if the

corresponding utterances follow consecutively.

eact: Edge connecting utterance nodes to a common actor node

appear-ing only within the same dialogue as the connected utterances.

(10)

Chapter

1

Introduction

In the field of Conversational Artificial Intelligence, it becomes necessary to process and understand the dialogue acts exercised by dialogue participants in order to produce natural conversations. In particular, the identification of dialogue acts helps us understand how a given conversation was executed and can be used to optimize systems towards more efficient search in the domain of task-oriented conversation. A Dialogue Act, as defined by Searle [1969], constitutes the atomic unit of linguistic communication. It provides a representation of the meaning of an utterance, regarding its illocutionary force. Intuitively, the recognition of dialogue acts entails detecting and understanding the semantic action intended by the actor given a set of words and phrases in the context of a dialogue. Consider the following conversation snippet from the SwDA corpus [Jurafsky and Shriberg, 1997]:

Speaker Utterance Dialogue Act

A are you on a regular exercise program right now? Yes-No-Question

B Yes, Yes answers

B and I hate it <laughter> Statement-Non-Opinion extending yes-answer

A <Laughter>. Non-verbal

B How about you? Open-Question

A Oh, well, Im kind of off and on. Statement-non-opinion

B Off and on Repeat-phrase

B well, I guess I’ve been kind of off and on Statement-non-opinion

B I’ve, uh Statement-non-opinion

Table 1.1: Example of dialogue corpus annotated with dialogue acts.

We utilize the above example to introduce some necessary terminology. Two actors {A, B} ∈ Ad commit utterances uj ∈ Ud, j∈ {1 . . . K} in a singular dyadic conversation d. Conversation d is defined as the sequence

of K utterances performed by actors A and B. Note that when talking about a dialogue di as part of a corpus

of dialogues, we use the terms conversation and dialogue interchangibly. The definition of utterances as the result of actors taking turns providing input to the conversation is pivotal to the understanding of dialogue acts. A single utterance may contain any sound, word or phrase that an actor exhibits in response to the current state of the conversation. Dialogue Act Classification, often presented in the literature as (DA) Recognition or Dialogue Intent Classification, therefore constitutes the identification of the illocutionary intent of utterances.

1.1 Dialogue as a process

In this paper we adapt the provided definition to also fit the analysis of conversations using process mining [Aalst, van der, 2011]. Process mining as a field provides techniques to discover and evaluate conformance of data in the form of event logs. As shown in the works of Compagno et al. [2018] and Vakulenko et al. [2019], a dialogue corpusD may be considered a set of event logs where each trace is a conversation di∈ D. It therefore

follows that utterances constitute the events of the process trace and the dialogue acts become analogous to event

classes. Using process discovery algorithms such as Inductive Miner (IM) [Leemans et al., 2014], frequencies

between event classes (in our case dialogue acts) may serve as indications of dependencies between co-ocurring dialogue acts. A discovered process describes a behavior observed in the corpus between actors in conversations and may be represented as a graph G = {V, E} where each node vi ∈ V uniquely represents a dialogue act

category (such as ”Yes-No-Question” in example 1.1) and edges ej∈ E represent transitions (weighted by their

frequencies) observed in the corpus. Such graphs provides research into conversational systems the tools to quantitatively assess their dialogue models [Compagno et al., 2018].

For any study of conversations using process mining the dialogues must first be discretised into a set of finite dialogue labels (e.g. Dialogue Acts), such as in Jurafsky and Shriberg [1997]. Because of the tedious and often overwhelmingly large amount of annotation required to study human to human conversations, we look to

(11)

Chapter 1. Introduction 2

dialogue act classification for an automated solution to this problem. As such, our work becomes instrumental for any large scale study of dialogue using process mining techniques, specifically in domains where data is expected to change frequently such as social media platforms or customer-service.

1.2 Problem definition

For the purposes of this work, we employ the formal definition of the dialogue act classification problem, as described by Chen et al. [2017]. We shall extend upon their problem definition such as to maintain conformance with the established literature while allowing for a natural interpretation of our proposed methods. Given a dataset D := {d0, d1, . . . dN₋₁}, consisting of N conversations between autonomous speakers, the goal is to

categorize each dialogue di ∈ D into the given alignment Yi := {y0, y1, . . . yKj−1}, where Ki is the number

of utterances in dialogue di and Yi is the desired sequence of discrete categorical act-labels aligned with the

respective utterances uj ∈ Udi. An utterance is formally defined as a variable length sequence of M words

such that ui := [w0, w1, . . . wM−1]. For any given corpus D, a set of labels YD defines the discrete categories

of dialogue acts that an utterance uj may take. As such, dialogue act classification strictly subsumes the

sequence to sequence classification tasks defining many Natural Language Processing tasks, for example, machine translation or part-of-speech tagging [Tran et al., 2017]. During inference, the purpose of a DA-classification model fθ(di) is to take as input a single dialogue di ∈ D and produce an ordered set of DA-labels, such as fθ:{u0, u1, . . . , uK−1}(K)→ {y0, y1, . . . , yK−1}(K).

1.3 Using context to perform DA classification

Recent progress in the field has leveraged hierarchical information while constructing classifiers for the task of DA-classification [Kumar et al., 2017]. State of the art models have been built on the notion that DA classification should be phrased as a sequence classification task using conditional random fields [Chen et al., 2017], [Kumar et al., 2017], [Raheja and Tetreault, 2019]. The overarching target is to overcome shortcomings of previous work where each utterance was considered independent [Ang et al., 2005], [Grau et al., 2004]. With the advent of graph neural networks as a promising tool to create graph-relational embeddings of data [Wu et al., 2019], we attempt to represent utterances as nodes in a dialogue graph G := {V, E}. Our belief is that

dialogue acts may be conditionally dependent on both future and past utterances, but also on the directionality of the utterances and the speaker producing them. The hypothesis is that by representing utterances as part of directed dialogue graphs, we may better capture such relational dependencies.

DA-classification is an inherently inductive procedure and many transductive graph network methods such as [Gilmer et al., 2017] and [Kipf and Welling, 2016] are ill suited for our task. This is because node representations on unseen utterances would require a complete re-training of an ever growing graph of dialogues. Recent work by Ghosal et al. [2019] leverages convolutional graph networks in an inductive setting for emotion recognition. This task is directly comparable to DA classification as it entails classifying a sequence of text into a sequence of labels. The authors construct graph representations of input dialogues at inference time according to some structural rules dependent on available actor-labels. The output node embeddings used for classification are the result of aggregating information over the neighborhood nodes.

In this work we use the inductive graph representational method GraphSAGE [Hamilton et al., 2017], and train models on the standard benchmark datasets SwDA [Jurafsky and Shriberg, 1997] and MRDA [Shriberg et al., 2004] to produce embeddings for utterances containing discriminative information. We then feed these embeddings into a downstream classifier in order to properly categorize the utterances into their respective dia-logue acts. Furthermore, we investigate the performance of our model on the datasets MANtIS [Gustavo Penha and Hauff, 2019] and MSDialog [Qu et al., 2018], containing user intent-labelled information seeking dialogues from online support-forums. Although the datasets differ in length and written structure, they offer us a broader domain in which to utilize graphical representations of human to human conversation.

1.4 Contributions of this work

The contributions of our work is two-fold:

1. We propose a methodology for transforming a corpus of dialogue utterances into a graph representation, allowing both textual features as well as dialogue neighborhood structure to be encoded into the utterance representations.

2. We apply graph neural networks, e.g. GraphSAGE, to generate embeddings for utterances to be used for dialogue act classification. To our knowledge this is the first study to consider dialogue as a domain viewed graphically rather than as sequences of utterances.

(12)

Chapter 1. Introduction 3

1.5 Thesis structure

The structure of this thesis is outlined as follows. Section 2 describes previous work done in the field of dialogue act classification and intent classification, two closely related tasks that we explore in this work to test the performance of our proposed method. Section 3 provides theoretical background on the embedding method as well as the GraphSAGE algorithm. In Sections 4 and 5 we describe our method and our experimental setup, respectively. Results and an analysis is given in Section 6. In Section 7 we conclude this thesis and give suggestions for future research.

(13)

Chapter

2

Related Work

2.1 Analysing dialogues as processes

In the work by Compagno et al. [2018], the authors conduct an analysis of human to human conversations using process mining techniques [Aalst, van der, 2011], namely a fuzzy algorithm for inductive process discovery imple-mented by Disco1_{. The goal of Compagno et al. is to empirically establish speech-act patterns in conversational}

data. In the paper, the authors manually label dialogues and manage to identify conversational preference in the speech acts ”Direct” (requirements, requests, suggestions) and ”Complain” (expression of negative sen-timent). Later work by Vakulenko et al. [2019] states that to improve information-seeking dialogue systems (performing conversational search) it is beneficial to first understand the interaction process between humans. To this end the authors also utilize process mining, but to varying domains, while discretizing utterances into the User/Agent and Proactive/Reactive categories: Query, Feedback, Request and Answer. This schema unifies annotations across varying domains of information-seeking dialogues and the authors also describe how to per-form conper-formance checking of dialogue systems using process mining. This type of work is insightful to creators of dialogue-systems as it may allow the identification of conversation flows carried out by humans in order to establish a user’s information need. For these types of studies to easily scale in a commercial setting, where datasets of dialogue are expected to grow, it becomes necessary to automate the annotation process. In this condition we turn to models that can perform sequence to sequence classification.

2.2 Dialogue act classification

Among the earliest work in the field of DA-classification is the research done by Grau et al. [2004]. In their paper the authors report a 66% accuracy on the SwDA dataset, using a naive Bayes classifier with 3-grams and Laplacian smoothing to handle the cases with 0 probability. An utterance was therefore represented by a vector of word-counts (or n-gram counts) and labeled examples were used to train the parameters of a probabilistic generative model. Selection of a DA-category was performed by choosing the class which produced the highest probability from the Bayes decision rule as such:

c⋆= argmax c Pr(c) I ∏ i=1 Pr(wi, wi−1, ..., wi−n+1|c) (2.1)

Although the work done by Grau et al. has been far superseded by more modern methods, their study does highlight the influence of stop-words on word count representations. They show that keeping stop-words in-creases accuracy by at least 10% across all experimental settings. This further highlights the necessity to not place DA-classification under the same umbrella of standard NLP tasks in which stop word removal has been considered a de facto pre-processing step. Clearly, an utterance’s linguistical role cannot be determined by important words alone and stop words such as ”where”, ”then” and ”because” carry important discriminative information when utterances are assumed to be codependent.

A previous, unrelated study using a Bayesian approach was conducted by Keizer et al. [2002]. In the paper, the researchers attempt to tackle the issue of conditionally independent features, given a dialogue act, assumed by previous work using naive Bayes classifiers. The study is constructed to increase performance on a human to machine conversational search environment, in which utterances features are considered finite and thus not the product of a user expressing themselves freely using natural language. Because of this convenient limitation, the authors may model such dependencies as P (DA|CanY ou, IW ant), where CanY ou and IW ant are some of the possible utterances available to the user and DA is one of the dialogue acts present in the dialogues. Using Bayesian Networks, which allows the modelling of dependency structure as well as conditional probabilities between dialogue acts and features, the researchers reported an accuracy of 76.5% on a small dataset of Dutch dialogues.

Dialogue Act classification is a task that may serve as an auxilirary task towards text summarization in a

1_{Disco: https://www.fluxicon.com/disco/}

(14)

Chapter 2. Related Work 5

meeting domain. In the paper by Ang et al. [2005], the authors recognize that to properly summarize a meeting dialogue, such as the transcripts making up the MRDA dataset, simple topic indicators (e.g. keywords) are not enough to provide a clear overview of the dialogue. Indicators such as ”who asked what” and ”who interrupted whom” become helpful in summarizing spoken dialogue between actors. Ang et al. provide a baseline for DA-classification in the meeting domain by grouping the labels of the dataset into the five classes: Statements,

Questions, Backchannels, Fillers and Disruptions. For classification the authors use a maximum entropy text

classifier, a method which has seen extensive application in the natural language processing community [Berger et al., 1996]. The benefit of maximum entropy models in text classification is that a model may maximize the probability density of the features found in the training data, while also maximizing smoothness of the model distribution, following the principle of maximum entropy. In essence, this entails that the model will approach uniformity in the regions where uncertainty is the greatest, thus solving the issue when no clear prior probability distribution may be assumed.

2.2.1 Neural approaches to DA-classification

In the wake of deep learning as a means of creating more powerful feature representations without manual efforts, many works have been published that utilizes hierarchical recurrent neural network (RNN) architectures. In the work by Tran et al. [2017], the authors design a hierarchical RNN model for the particular use-case of live dialogue systems where DA’s must be predicted without having access to a finished dialogue transcript. This is in contrast to our work which can be considered useful for detailed conversation summarization where a full dialogue history is available at each utterance. In the paper the authors used a RNN with hidden size of 160 to encode the utterance text, previously embedded by a self-trained word embedding matrix, similar to [Mikolov et al., 2013] or [Pennington et al., 2014]. This was followed by an attention layer, as used by Shen and Lee [2016], conditioned on the hidden state of a subsequent RNN layer used for embedding the output of layer one. In addition, a learnt embedding of the actual DA label predicted in the previous time-step was also used to condition the attention module. This allows the model to take into account previous model outputs when embedding future utterances for classification. Their model achieved 74.5% accuracy on the SwDA dataset. The authors also highlight the effect of allowing attention layers at subsequent time steps direct access to the ground truth label of previous steps. This however degraded performance and the authors speculate that that the issue is related to the so called exposure bias, described in detail in Ranzato et al. [2016].

2.2.2 Explicit modelling of dependencies between dialogue acts

Due to the dependency of current dialogue acts on previous ones, such as Greeting often following Greeting and Yes-No-Answer following Yes-No-Question, it may benefit a model to explicitly model the sequence of output predictions. In Kumar et al. [2017], the authors also utilize a Hierarchical Recurrent Neural Network architecture to generate utterance embeddings at a conversational level, but without utilizing an attention module. The output of the HA-RNN is fed into a linear-chain conditional random field (CRF) [Lafferty et al., 2001] for classification, rather than a softmax layer as used in Tran et al. [2017]. Conditional random fields have in recent years been considered a standard approach towards sequence tagging, such as POS or NER recognition, with immediate gains in terms of accuracies [Akbik et al., 2018], [Song et al., 2019]. Intuitively, a linear chain CRF classifier calculates a probability for each sequence, taking into account only the immediately preceding dialogue act. The CRF module models the joint probability of a label p(yj|gj, yj₋₁), for the entire

sequence of labels in a given dialogue as such:

p(y1, y2, ...y_|Y_di_|, u1, u2, ...u_|U_di_|; θ) =

∏|Ydi| j=1 ψ(yj−1, yj, gj; θ) ∑ Y ∏|Ydi| j=1 ψ(yj₋₁, yj, gj; θ) (2.2)

where the CRF module is parameterized by transition matrix θ. The unary feature function (modelling the probability p(yj|uj)) is simply the output gj of the final HA-RNN layer. The state transition matrix is of size

K× K for a classification task with K types of dialogue acts and remains the same for every position in the

sequence. The normalization across every possible pair of labels (∑_Y) is of complexity O(|Y||Udi|_{) but may}

be computed more efficiently at inference time using the Viterbi algorithm [Viterbi, 1967]. In essence, a CRF allows the designer of a classification architecture to condition a model on its preceding predictions, in contrast to a softmax layer which only takes into account input features of the present data point.

With this contribution, Kumar et al. achieve 79.2% accuracy on the SwDA dataset as well as 90.9% on the class-reduced version of MRDA, used consistently by the works surveyed in this section. As an added note, the authors also report common issues of performing evaluation on the SwDA dataset due to annotation disagreement occurring within the dataset. Due to the large subjectivity of dialogue act annotations done by linguists, the aspired accuracy of models is often reported relative to the 84% human annotation accuracy. As an extension upon the works by Tran et al. and Kumar et al., Chen et al. [2017] embed their word-tokens by a

(15)

concatenation of part-of-speech and named-entity-recognition embeddings, as well as character level embeddings following the method of Kim [2014]. Similar to Tran et al., the authors use an attention module in the generation of utterance embeddings, and following Kumar et al., the authors also use a linear chain CRF for classification. With this combination of previous methods, the authors achieve 81.3% and 91.7% accuracy on the SwDA and MRDA datasets, respectively. These results were considered state of the art until the publication by Raheja and Tetreault [2019] where authors used GloVE embeddings in combination with the increasingly popular ELMo embeddings by Peters et al. [2018] to generate word-token embeddings. The main contribution however was the use of conversational level bidirectional-Gated Recurrent Unit embeddings (specifically the forward pass utterance embedding ⃗ujforward), to condition the attention module operating on utterance level. This method is known as self-attention, following the instructions of Lin et al. [2017], and is in contrast to the work by Chen et al. where attention was conditioned on the utterance at the current time step. Using these additions to the architectures, the authors reported 82.9% accuracy on the SwDA dataset.

2.3 Intent classification

User intent classification is a closely related task to dialogue act classification. It can also be framed as a sequence to sequence recognition problem, primarily focused on detecting the purpose of an utterance in the domain of information seeking dialogue. Examples of common user-intent labels are Greeting, Farewell and

Repeat-Question. In the work by Qu et al. [2019], the authors evaluate performance of a user-intent classification model

on the MSDialog dataset [Qu et al., 2018], containing 2199 user-intent annotated information seeking dialogues (see dataset details in Section 5.1). The authors argue for the necessity of user-intent classification as a tool for proactively eliciting information from a user based on the percieved intent, instead of grounding decisions purely in matching tokens with some external database. The problem of user-intent recognition is made the more challenging considering longer utterances, often containing hundreds of words and hence covering multiple types of intent simultaneously. Hence, the task is defined as both a multi-class and multi-label classification task. For example, intents such as Greeting and Original-Question often co-occur in utterances provided by information seeking agents.

As is the nature of conversational datasets of human to human dialogue, the labels of user-intent are highly imbalanced, with the top 32 combinations of labels constituting 90% of the labeled set. Due to suspected annotation error by the crowdsourced method employed by the creators of the dataset, the authors reduce the label space through a series of transformations. Firstly, any multi-label utterance containing Greeting/Gratitude,

Junk or Others is reduced such that the label not belonging to the aforementioned set is kept. This places a

higher priority in representing the data using labels such as Positive-Feedback or Clarifying Question. This is motivated by the research goal of investigating patterns specific to information seeking, in which conversation openers, farewells or stray comments are considered unnecessary. Secondly, multi-label combinations containing several user-intents are often found only once in the entire dataset. These are reduced through a sampling schemata where a single intent-label is sampled uniformly as such:

Luj ∼ {GG, CQ, FD, IR}uj, uj ∈ Udi (2.3)

where label Luj representing utterance uj takes on one of the labels from the original set of four, seen in

equation 2.3. Following these label reduction steps, the authors reduce the number of label combinations from 316 to 33. The authors justify the labelset reduction on the grounds of having far to few available examples of each combination to properly train a predictive model. Using an ensemble of hand-crafted features, such as utterance-positioning, presence of 5W1H-words (what, when, where, why, who, how), questionmark-counters, length of utterance as well as sentiment scores, the authors train a series of classifiers to be used as baseline models. Among these, a random forest model performs best and is also used for an in depth feature importance analysis. It is shown that the addition of sentiment scores yields negligible performance increase in comparison to structural features such as utterance position and length. This may be explained by the lack of dominating emotional expressions on online-support forums, from which the dataset originates. It is unlikely that users would express great positive sentiment unless the utterance is of type Farewell or Positive-Feedback. Finally, in order to compare modern neural methods with structural features, the authors report results using Kim-CNN [Kim, 2014] and BiLSTM [Hochreiter and Schmidhuber, 1997] based models. They report that a convolutional neural model, taking as input at each timestep the current utterance ut, as well as ut₋₁ and ut+1, achieves 68%

accuracy on the labelreduced MSDialog dataset, with an F1-score of 71.3% and a precision of 78.8%.

Building upon the work of Qu et al., Yu et al. [2019] note that in a user-intent recognition task, an utterance may refer back to any previous utterance in the domain of online community forums. It therefore becomes necessary to model the dependency of utterances across time, to which end the authors propose a convolutional recurrent neural net. The model achieving highest predictive performance utilizes highway-connections [Sanders and Schultes, 2005] to allow a dense-classification layer access to both convolutional and recurrent features. K-Max

(16)

Dynamic pooling, using p-sequences of m-word-vectors is used to aggregate information across the utterances due to their length. The authors report 68% accuracy with an F1-score of 73% on the MSDialog dataset, without utilizing the label reduction method employed by previous work.

Due to the shortcomings of previously established datasets in the conversational search domains, Gustavo Penha and Hauff [2019] define a new dataset called MANtIS, consisting of utterances from online support forums conforming to a set of desiderata concerning the use cases of conversational search modelling. More details explaining the purpose and collection of the MANtIS dataset can be seen in Section 5.1 of this thesis. In the dataset, 6701 utterances have been annotated in a multi-label fashion based on a subset of the labels used in MSDialog. Due to the limited availability of labeled utterances, as well as the notable class-imbalance, results of user-intent classification are reported as the mean result across a 10-fold cross validation. The authors perform experiments using two neural models. Firstly, a BiGRU [Cho et al., 2014], which is in essence constructed in the same manner as a bidirectional LSTM, using only one gate for forget/input instead of two separate ones as in the case of traditional LSTMs. Secondly, the authors employ a fine-tuned BERT model [Devlin et al., 2018], using a fully connected layer on the output in both cases to perform prediction. The highest classification score is obtained using BERT, with a 79% average precision and an F1-micro and F1-macro score of 75% and 59%, respectively.

(17)

Chapter

3

Background

3.1 Conversational search

Conversational search, in contrast to classical question answering (QA), requires a system to maintain an internal understanding and hypothesize about which pieces of information that are necessary in order to accurately meet a user’s information need [Gao et al., 2018]. Unlike QA in which the user is assumed to provide informed queries, leading to immediate answers, conversational search agents must actively engage and inform the user during the information search. This presents numerous challenges such as determining the user’s understanding of their proposed issue, hypothesizing about possible solutions, ranking responses based on immediate information need and maintaining a natural user experience. Based on these objectives, conversational search as a research field naturally subsumes information retrieval and natural language understanding. Utterances and responses within a conversation between agents must be grounded in existing knowledge as well as predictions of the state of the dialogue.

In the work by Radlinski and Craswell [2017], the authors present a theoretical framework of the properties that an ideal conversational search agent should exhibit. Based on an analysis of human to human interaction, the postulate that one of the key requirements of a conversational information search is to track the expected outcome of users responses. The usage of memory of past utterances also becomes necessary to provide clarifying questions. In both cases, the discretisation of utterances into dialogue act categories may help a dialogue system in these endeavours. The system may use such labels to help filter and rank responses

3.2 Generation of initial node embeddings using ELMo

In order to create initial node features to be fed as input to our GraphSAGE based aggegation function(s), we utilize the well known ELMo embedding method by Peters et al. [2018]. Upon its inception, ELMo has been shown to advance the state of the art results on many NLP tasks significantly and unlike previous methods such as word2vec [Mikolov et al., 2013] and GloVe [Pennington et al., 2014], ELMo does not produce fixed embeddings for each word. Given a document (analogous to an utterance in our work), ELMo begins by creating a raw word vector representation using Convolutional Neural Nets on a character level to create context independent representations. These vectors are then fed as input to a two layer bidirectional-LSTM [Hochreiter and Schmidhuber, 1997]. The three outputs from the CNN-character level encoder and the two biLSTMS, respectively, are aggregated into a weighted sum to produce the final embedding of the document of shape R(num_words)_{×(hidden_dimension)}_{. This means that our utterance embeddings become functions of the}

utterances themselves, which for our work is pivotal given the highly contextual nature of dialogue and speech acts.

3.3 Inductive graph representation learning

Our main motivation for using GraphSAGE as our learning method of choice is related to the environment in which DA-classification systems are likely to be employed. As social media streams of recorded conversations are collected, it is to be expected that such graphical representations would grow and that graph structures would change frequently. It follows therefore that generating embeddings from fixed size graphs would become computationally infeasable by requiring the re-training of node embeddings upon every introduction of new nodes.

The alignment of new subgraphs to already optimized graphical embeddings becomes non-trivial when one considers that a node’s neighborhood may host both local and global structural attributes. Because of this it becomes necessary that any graph embedding method must allow the inference of neighborhood information based on node embeddings alone. Methods such as [Kipf and Welling, 2016] use convolutions as the method of aggregating information across neighborhoods. Since there is an inherent direction of intent and response in human to human conversations, convolution over neighborhoods may not easily capture the flow of a human dialogue. In this setting we instead opt for a method based on generating node embeddings through random

(18)

Chapter 3. Background 9

walks throughout the graph.

The name GraphSAGE is a combination of the words SAmple and aggreGatE. GraphSAGE in of itself is not a model architecture, but rather a framework for learning node embeddings based on topological neighborhood structures and feature distributions simultaneously. Figure 3.1 illustrates how a set of aggregation functions

Figure 3.1: Illustration of GraphSAGE pipeline. [Hamilton et al., 2017]

(aggregator1 and aggregator2) are trained to aggregate information from the local neighborhood of a node.

These learnt functions may then be applied during inference (or testing) time to generate embeddings for nodes never seen before. In essence, an aggregation function can take any form deemed suitable for generating em-beddings in a given domain. A key takeaway is that GraphSAGE does not train emem-beddings for individual nodes in a graph, rather it trains aggregation functions, such as multilayer perceptrons, CNNs or LSTMs, that may generalize across similar graph-structures and node features. Intuitively, an aggregation function can be seen as an encoder which traverses a graph in accordance to some pre-sampled set of paths. In this paper we consider dialogues as a suitable domain for such a modelling schema as human to human conversations inher-ently follow clear paths where questions prompt answers, statements prompt questions and greetings instigates further greetings.

3.3.1 Forward propagation step

Assuming the selected aggregation functions have already been trained, the steps outlined by Hamilton et al. can be described as follows. Given a graphG = (V, E), where each node is represented by features xv,∀v ∈ V,

the output of the GraphSAGE forward propagation are vector representations zv,∀v∈ V. Initially, the hidden

state representation of each node is set to the provided node features h0

v ← xv,∀v ∈ V. To exemplify our

description of the GraphSAGE algorithm, we shall consider the case of two aggregation functions fθ1 and fθ2, parameterized by θ1 and θ2, respectively. This constitutes a graphSAGE embedding generation of depth

K = 2 with corresponding search-depth information propagation weight matrices W1and W2_{. A neighborhood}

functionN , constructed as a lookup table of pre-generated random-walk samples, is used to generate a set of

neighborhood nodes N (v) → {u0, u1, . . . , uM−1} where the size of the neighborhood M is determined as a

hyperparameter of the algorithm.

For the first search depth (k = 1), aggregation function fθ1 is utilized on each node in the graph. For each node v∈ V, we begin by sampling the node neighborhood using the random-walk sampling function N (v). The

respective node features h0

u of each node in the selected neighborhood of v are aggregated into a single hidden

neighoborhood embedding as h1_{N (v)}← fθ1 ( {h0 u,∀u∈ N (v)} ) (3.1) After neighborhood feature aggregation is completed, the hidden representation of h1

vis created by concatenating

the node’s current representation with the neighborhood representation, followed by a fully connected layer parameterized by W1 _{and a non-linear activation function σ.}

h1v← σ

(

W1· concat(h0v, h1_{N (v)}

))

(3.2) After the hidden representations of each node in the graph have been updated to the output of the non-linear activation function σ, the steps in equations 3.1 and 3.2 are repeated for aggregation function fθ2, using depth propagation weights W2 _{after concatenating the outer neighborhood. With search depth of length K = 2, this}

(19)

concludes one forward propagation of the GraphSAGE embedding generation. The final node embeddings are set to be the output of the hidden representations after applying the final aggregation function zv ← hK=2v ,∀v∈ V.

3.3.2 Training of model parameters

One of the main strengths of the GraphSAGE framework is the ability to provide embeddings for nodes to be used in any downstream machine learning task. As such, the authors provide a methodology to train GraphSAGE aggregation functions in an unsupervised manner, by designing a loss function which minimizes similarities between nodes in distant neighborhoods and maximizes the similarity between nodes that co-occur on paths of random walks.

J_G(zu) =− log(σ(z⊤uzv))− Q · Evn∼Pn(v)log(σ(−z

⊤

uzvn)) (3.3)

Equation 3.3 defines the unsupervised loss J_G(zu), which may be used to tune the parameters of the

propa-gation matrices W{1...K}as well as the parameters of the aggregation functions θ_{1...K} using an optimization algorithm such as stochastic gradient descent. Pn is a negative sampling distribution and can intuitively be

seen as the inverse of the positive sampling function N used in forward propagation. The unsupervised loss is minimized when the dot product of node embeddings zv and its neighbor zu increases and when the dot

product of embeddings for non-neighbors zuand zvn decreases.

For any end-to-end learning task in which GraphSAGE nodes are used for downstream classification, the loss function in equation 3.3 may be replaced by a supervised loss function (e.g. binary/categorical cross-entropy loss). In this case the error-signal used to update the differentiable aggregation functions is supplied by maxi-mizing the likelihood of the datasetD under the provided vector representations zv,∀v∈ V, as follows:

J_G(zu) =− |Y∑D|

i

ti× log(fϕ(zu)) (3.4)

where ti is the one-hot encoded label vector corresponding to datapoint i and the classifier is parameterized

by ϕ. In our work we use this loss to optimize a classifier fϕ and the GraphSAGE parameters jointly, in an

end-to-end fashion.

3.3.3 Aggregation functions

Hamilton et al. propose three different types of aggregation functions to be used for generating node embeddings. The authors state the necessity of aggregation functions to work symmetrically with respect to their input, meaning that the functions become invariant to the order in which nodes appear in a sampled neighborhood. The first candidate aggregation function presented simply calculates the mean of the current node vector representation hk−1

v and its sampled neighborhood, skipping the concatenation step.

Mean Aggregator : hk v ← σ ( Wk· mean({hk_v−1} ∪ {hk_u−1,∀u∈ N (v)} )) (3.5) This function can be seen as a linear approximation of a spectral convolution, and is closely related to the propagation rule used by Kipf and Welling.

Using an LSTM cell, the authors also propose an aggregation function LSTM-Aggregator. Because LSTMs are inherently directional, hence not symmetric, the aggregation function is adapted to take as input random permutations of the input nodes u∈ N (v) of the neighborhood set. Remarkably, despite LSTMs being

non-symmetrical, the authors still reported state of the art results on a Reddit community classification task where each post is represented by its textual content.

Lastly, the authors propose a max-pooling based aggregation function using a single fully conntected neural network layer, followed by a non-linearity and the max-pooling operator.

Pooling Aggregator : hk v= max ( {σ(Wk_poolhk_u−1+ bk),∀u∈ N (v)} ) (3.6) Principally, the pooling operator could be replaced by any element-wise aggregation method (e.g. mean-pooling) and the linear layer by any deep neural network architecture taking as input a single node. At the time of writing, the pooling aggregation function achieved state of the art results on the Web Of Science citation dataset. Any aggregation function is required to adhere to the following requirements: its output needs to be differentiable with respect to its parameters θk, it needs to be able to operate on un-ordered sets of vector-embeddings and

(ideally) act symmetrically with respect to its input. These constraints leave much room for experimentation into what functions might yield the highest discriminative power for a given machine learning task. In the work of Veličković et al. [2017], the authors propose a self-attention mechanism which operates over an input

(20)

neighborhoodN (v) to a selected node v. The idea behind self-attention over graphs is to use the features of a

given node neighborhood hu to determine how important each element of huis to the representation of chosen

node hv. The process can be seen in equations 3.7 and 3.8, fully expanded in our current GraphSAGE notation.

avu= exp ( LeakyReLU(a⊤_k[concat(Wk_hk−1 v , Wkhku−1 )])) ∑ u∈N (v)exp ( LeakyReLU(a⊤_k[concat(Wk_hk−1 v , Wkhku−1 )])) (3.7) Attentional Aggregator : hk v= σ(optional) ( ∑ u∈N (v) avuWkhku−1 ) (3.8)

In the graph attention (GAT) model rewritten as a GraphSAGE aggregation function, the matrix Wk_{is a linear}

transformation at search depth k with dimensionality RF′×F _{where input node vectors are expected to have}

dimension hi ∈ RF. The attention mechanism can be seen in equation 3.7 and is composed of a single-layer

neural network, parameterized by ak ∈ R2F

′

, followed by a LeakyReLU non-linearity. The resulting attention coefficients are normalized across all neighborhood nodes using the softmax function and are used in equation 3.8 to scale the input features of all neighboring nodes∀u∈ N (v).

(21)

Chapter

4

Method

In this section we describe the complete steps of creating a graphical representation of a dialogue corpus D,

splitting this graph into training, validation and testing data and ultimately generating node embeddings using GraphSAGE that are fed as input to our downstream classification model for DA-prediction. An overview of our training and inference pipeline can be seen in Figure 4.1.

A dialogue corpus suitable for our method is at minimum expected to contain fields identifying which conversa-tion an utterance belongs to, meaning a unique identifier for each dialogue di∈ D. Furthermore, each dialogue

requires separate utterances uj ∈ Udi annotated with an actor tag ak ∈ Adi, as well as the corresponding

dialogue act label from the set of labels used to annotate the corpus yj∈ Y_D. Initially, the corpus is separated

into splitsDtrain,Dvalid andDtestand fed through a preprocessing step, taking as input the raw utterance text

for each utterance in a select split. Details of our pre-processing implementation can be seen in Section 5.2.

Training

Inference

Labelled Dialogue Corpus (e.g. SwDA) Pre-processing of utterance text. Creation of graphical representation (see Algorithm 4.2) ELMo Training: GraphSAGE aggregation function(s) Training: Dialogue Act Classifier Supervised loss Validate Trained: GraphSAGE aggregation function(s) Trained: Dialogue Act Classifier A single unseen dialogue Pre-processing Creation of graphical representation ELMo Embedded test-set Metrics: Precision, F1-Micro, F1-Macro Predicted DA-labels

Figure 4.1: Training and inference pipeline of our method based on GraphSAGE.

(22)

Chapter 4. Method 13

After utterances have been processed, we employ our proposed graph creation schema outlined in Algo-rithm 4.2 to represent the dialogues as directed graphs. We note that our practical implementation performs train/valid/test selection post graph creation on the whole datasetD, but for clarity we visualize the process as

separate in concordance to common datasplit practices. Each node representing an utterance is subsequently embedded using the ELMo embedding method of Peters et al.. The proposed methodology is however agnostic to the choice of raw text embedding. This is followed by a supervised training process in which our selected GraphSAGE aggregation functions, as well as DA-classifier, are trained jointly in an end-to-end fashion, maxi-mizing the likelihood of our corpus using gradient descent.

During inference time, often called the inductive setting, our chosen set of nodes used for testing are fed as input to the trained GraphSAGE aggregation functions. The output of which is fed to the trained classifier for prediction.

4.1 Dialogue Graph

Because our ultimate goal is to classify each utterance into a distinct categorical label, each utterance is rep-resented as a unique node in the graph. Our problem is henceforth stated as a node classification task. We define three separate types of nodes, actors, utterances and a global start node. An utterance node vutt _{∈ V}

is represented by a continous vector embedding of the preprocessed utterance text. Actor nodes are used to represent the flow of the conversation in terms of who speaks to whom and are initialized as continuous vectors sampled from a Gaussian distribution. Each graphical representation contains a single start node which is used to signify the beginning of a dialogue. Our intuition is that by explicitly modelling the beginning of a dialogue using a fixed conversation starting node, our model would learn to detect nodes appearing early in the conversation which often represent utterances of types Greeting or Original Question. In Figure 4.3 we illustrate the graph created using an excerpt from a dialogue corpus as input. Black edges represent the flow of the conversation between utterances. We refer to these edges as Flow edges := ef low which allow for nodes to

carry information to all following nodes in the dialogue. In order to capture the relation between intent of the actors present, we create bi-directional edges between utterances and the actor producing them. These edges, named Actor edge := eactallows nodes to pass messages inside neighborhoods surrounding the person speaking.

Figure 4.2 Constructing directed graph from corpusD

Input: A dialogue corpusD containing dialogues di∈ D, i ∈ [1, 2, . . . N].

1. Preprocess all utterances uproc

j ← preProcess(uj)∀uj ∈ Udi,∀di ∈ D

2. Embed all utterances hj← embedding(uprocj )∀uproc_j ∈ U proc

di ,∀di ∈ D

3. Initialize empty graphG = {V = {∅}, E = {∅}}

4. Add start node to graph: addNode(vstart_,_V);

5. For all dialogues di∈ D:

(a) For all utterances uj∈ Udi:

i. vutt

j ← initNode(hj);

ii. addNode(vutt j ,V);

iii. if actorOf(uj) not inV: addNode(vjact,V);

iv. if vj = v0: createEdge(vstart, vuttj , ef low,E);

v. if vj ̸= v0: createEdge(vuttj₋₁, vjutt, ef low,E);

vi. createBiDirectedEdge(vact

j , vuttj , eact,E);

vii. For all other actors vact

k ∈ Adi, k̸= j:

• createEdge(vutt

j , vkact, eact,E);

6. Return completed graphG

4.2 Graph construction methodology

Formally, we define the graph creation process as the procedure outlined in Algorithm 4.2. Given any dialogue corpus which meets the aforementioned criteria, the utterances are preprocessed and embedded using a given

(23)

Chapter 4. Method 14

embedding method (e.g. ELMo, GloVE, BERT). The start node is added to the graph, initialized as a continuous vector sampled from a Gaussian distribution, similarly as the actor nodes. We iterate through each dialogue in the corpus and for each dialogue, create one actor node for each actor present as well as unique nodes for each utterance. The edges of the graph are entirely directional, as is the nature of human to human conversation we are hoping to model using our aggregation functions. Each utterance uj∈ Udi is assumed to originate from

a single actor ak ∈ Adi, and be directed towards all other actors (listeners). The state of the actor nodes will

therefore be updated, through aggregation, by the utterances said by the actor herself and what has been said to actor ak by other actors. An illustration of a resulting graph representation can be seen in Figure 4.3. We

note that the labels indicated on the right graph in bold font are not present in the output graph, but rather serve to visualize which node belongs to which utterance in the corpus to the left. The labels, as with most supervised tasks, are recorded as vectorsYdi ∈ D, separate from the input data. At inference time, our model

would learn to recognize the correct DA of utterance uj = ’Right’, depending on the neighborhoodN (uj).

Actor Utterances ..

. d0∈ D

A because I was n’t really like listening to the world go by back then

A so it ’s hard to compare .

A You know what I mean ?

B Right . Affirmative non-yes answer

A But , uh , it seems like when I li , I lived in Chicago ,

A I , I ’ve lived down here .

A I hear the dropout rates from the schools , ..

. d1∈ D

B yeah ,

B I was going to say more , it ’s more personal for one thing .

B You probably have a better team , uh , cooperation , or team playing , atmosphere .

A Right . Acknowledge (Backchannel)

B Probably , where as in a bigger corporation , I think you ’re just a number , you know .

A Yeah .

A You end up being your own person – ..

. d2∈ D

B Well , I think capital punishment is sup-posed to be primarily a deterrent to other people .

A Yes .

B You know , who would see it .

A Right , Agree/Accept

A that would be the intent of it .

B Yeah ,

B but I ’m not sure how successful that is .

"Node representing actor A in dialogue 0" Continuation of dialogue 1 Probably, ... because I ... yeah, Well, I think so it's ... You know ... Right Afﬁrm. But, uh, it ... I, I've lived ... I hear the ... I was going .. You probably Right Ack. B. Yeah . You end up .. Yes . You know, ... Right Agree that would be Yeah ,

but I'm not sure how successful that is .

d

₀

d

₀

A B

d

₁

B

d

₁

A

d

₂

B

d

₂

A Continuation of dialogue 0 Actor edge Flow edge

Figure 4.3: Left: An excerpt from a dialogue corpus displaying utterances and actor tags. Utterance ’Right’ may take on different labels depending on context. Right: Illustration of the directed dialogue graph created by applying steps outlined in Algorithm 4.2.

(24)

Chapter

5

Experimental Evaluation

5.1 Datasets

To test the performance of our proposed method we utilize the Switchboard Dialogue Act Corpus (SwDA) [Jurafsky and Shriberg, 1997] and the ICSI Meeting Recorder Dialogue Act Corpus (MRDA) [Shriberg et al., 2004] benchmark datasets. Because DA-classification, in a methodological sense, is analogous to techniques in intent recognition and emotion tracking, we also perform experiments on the multi-domain conversational search dataset MANtIS [Gustavo Penha and Hauff, 2019] and the online support forum Q&A dataset MSDialog [Qu et al., 2018]. For an overview of the characteristics of each dataset, we refer the reader to Table 5.1

Dataset Number of dialogues Speakers/dialogue Utterances/dialogue Number of classes

SwDA 1155 2 191.9 43

MRDA 75 6.10 1480.30 5

MSDialog 2199 2 4.56 12 (multi-label)

MANtIS 1356 2 4.94 9 (multi-label)

Table 5.1: Overview of datasets used in this thesis.

5.1.1 SwDA: The Switchboard Dialogue Act Corpus

SwDA consists of 1155 conversations and is an annotation extension completed in 1997 which built upon the already existing Switchboard-1 Telephone Speech Corpus, collected in 1990 by Texas Instruments [Godfrey and Holliman, 1993]. About 2’400 conversations were recorded from all areas of the United States where participants were randomly paired up and provided with topics of discussion. The conversations are five minutes long and recorded through telephone. 302 male and 241 female participants collectively discussed about 70 different topics. In 1997 a subset of the gathered calls were annotated with fine-grained detail out of which 42 final classes were created by hand-clustering the most frequently occurring categories. As such the dataset uses 42 distinct classes of dialogue acts with an average of 271 utterances per conversation. The dataset is highly inbalanced with the most frequent classes Statement-non-opinion, Acknowledge, and Statement-Opinion occurring 36%, 19% and 13% respectively. The remaining 39 dialogue acts occur less than 5% of the time per class, spread over the remaining 32% of the utterances in the corpus.

5.1.2 MRDA: ICSI Meeting Recorder Dialogue Act Corpus

The ICSI Meeting Recorder Dialogue Act Corpus published by Shriberg et al. [2004], is a corpus of recorded dialogue from about 72 hours of 75 naturally occuring meetings. The topics discussed at the meetings were about automatic speech generation, natural language processing and other language theories. With 53 unique speakers, each recording has been segmented into chunks of about 10 minutes of content with on average 6 people per meeting. Because of the complicated interaction structures there is high overlap between the speakers, as to be expected from natural meetings. The labels of MRDA are an adaption of the SwDA labelset, as such its distribution is also highly skewed. Out of the 54 labels used, the top 5 distributions are Statement 42.85%,

backchannel 8.42 %, floor-holder 4.65% (words such as ”anyway”, ”I mean” and ”uhm” meant to pause and

continue to speak), abandoned 4.39% and acknowledgement 4.05%. Most remaining tags occur less than 1% respectively. The dataset was collected using a mixed method of automatic and human corrected transcription of audio-files, as described in greater detail in Janin et al. [2003]. In this study we adopt the five-class-mapping procedure outlined by Ang et al. [2005] and group together dialogue act labels from the original 56 unique categories down to the higher level classes Statement (58%) , Disruption (13%), Backchannel (12%), Filler(7%) and Question(6%).

(25)

Chapter 5. Experimental Evaluation 16

5.1.3 MANtIS: multi-domain information seeking dialogues

The MANtIS dataset, introduced by Gustavo Penha and Hauff [2019], is a self-proclaimed ideal dataset for conversational search research. The dataset contains slightly more than 80 thousand dialogues spanning 14 different topics of conversation. The authors define two primary goals of conversational search that guide the data collection process: Information-Need Elicitation and Information Presentation. Respectively, these entail that an ideal conversational search (CS for short) dataset should represent a system’s inquiries into a user’s information need as well as clear attempts at delivering the relevant information. Following these criteria, the authors proceed to collect data based on the following requirements:

• The dataset should contain conversations where both user and agent produced at least two utterances per person. This is simply referred to as Multi-Turn Dialogues.

• The user must clearly show an information need for the dialogue to be considered information seeking. • The dataset should contain mixed-initiative clarification questions to fascilitate a search process.

• Utterances should be labelled with possibly multiple intent labels such as a combination of ”Greeting”

-”Follow Up Question”.

• The dataset needs to contain multiple topics of conversation such as to cover a broad domain against which to evaluate CS-systems.

• The answers provided by information providers (or Agents) should be grounded in some tracable base of facts, and not simply the conversation history.

The dataset has been collected from question-answering threads on the forum Stack Exchange. A conversation is deemed fit to be included in the dataset based on several rules, including no spam/offensive content as well as clearly stated positive feedback from the information seeker at the end of the dialogue. For our purposes, we use MANtIS to investigate performance on our sequence to sequence discriminative model. Out of the 80k collected conversations, 1356 have been sampled for manual utterance labeling. Nine different intent-labels were used, based on labels proposed by Qu et al. [2018], with a Krippendorff’s annotation α of 0.71 [krippendorff, 2011]. Similarly to the SwDA and MRDA datasets, the labels are highly imbalanced, with the classes Original

Question, Further Details and Potential Answer making up more than 60% of the dataset collectively. An

important note about the MANtIS dataset is that around 21% of all utterances were annotated with more than one label, in accordance to the aforementioned rules. This makes our task more complex as the problem takes on the form of multi-label classification. We also note through our initial data exploratory phase that none of the dialogues selected for user-intent classification carries the tag for correct answer, given by the community. For our purposes, this lack of positive feedback should be negligible in order to detect user-intent but may be of importance for future work studying the effect of successful vs. failed information seeking.

5.1.4 MSDialog: Microsoft community support forum

The MSDialog dataset contains question-answering interactions from the Microsoft Community product support forum [Qu et al., 2018]. It has been annotated by Amazon Mechanical Turk for the purpose of performing research on information seeking, human to human interaction (e.g. flow-models and intent distributions). In our research, we consider the intent labels as utterance labels in a sequence of text to sequence of labels classification problem. To that end, MSDialog provides 2199 dialogues that have been selected for utterance level annotation, also using A.M.T. The conversations were chosen based on the following rules:

1. A dialog should contain three to ten turns.

2. It should contain two to a maximum of four participants. This is in contrast to the MANtIS dataset in which turn labeled conversations could only occur between at most one seeker and one provider of information.

3. At least one of the answers provided in the dialogue has to contain a ”Correct Answer” tag provided by the community. This serves as an indication of conversation quality.

4. The dialogue has to be categorized as one of the major product types: Bing, Windows, Office or Skype. Much like in MANtIS, an utterance may be given more than one utterance label such as yj = {GG, F Q} , yj ∈ Ydi where GG and FQ are defined as ”Greetings/Gratitude” and ”Follow Up Question”, respectively.

Because of the nature of online forum, in contrast to human conversation, the turns are much longer with an average of 65 words per utterance. Each dialogue is also shorter in turns with 4.56 number of turns on average per annotated dialogue. A similar statistic can be observed for MANtIS. For both MANtIS and MSDialog, we consider ”users” and ”agents” to be the two separate types of recognized actors. As such, even though multiple people may take on the role of user and agents, we choose to model each dialogue as having only two actor-nodes. This also holds true for the MANtIS dataset. Our motivation is that modelling each actor individually by name, in the short dialogues present on online forums, would make the dialogue flow harder to learn for our model.

Dialogue Act Classification using Inductive Graph Learning

MSc Artificial Intelligence

Master Thesis

Dialogue Act Classification

using Inductive Graph Learning

Arvid Lindström

July 6, 2020

Supervisor:

Dr. Svitlana VAKULENKO

Assessor:

Dr. Evangelos KANOULAS

FACULTY OF SCIENCE

Declaration of Authorship

UNIVERSITY OF AMSTERDAM

Abstract

Dialogue Act Classification

using Inductive Graph Learning

by Arvid Lindström

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

List of Symbols

Chapter

1

Introduction

1.1 Dialogue as a process

1.2 Problem definition

1.3 Using context to perform DA classification

1.4 Contributions of this work

1.5 Thesis structure

Chapter

2

Related Work

2.1 Analysing dialogues as processes

2.2 Dialogue act classification

2.2.1 Neural approaches to DA-classification

2.2.2 Explicit modelling of dependencies between dialogue acts

2.3 Intent classification

Chapter

3

Background

3.1 Conversational search

3.2 Generation of initial node embeddings using ELMo

3.3 Inductive graph representation learning

3.3.1 Forward propagation step

3.3.2 Training of model parameters

3.3.3 Aggregation functions

Chapter

4

Method

Training

Inference

4.1 Dialogue Graph

4.2 Graph construction methodology

d

0

d

0

d

1

d

1

d

2

d

2

Chapter

5

Experimental Evaluation

5.1 Datasets

5.1.1 SwDA: The Switchboard Dialogue Act Corpus

5.1.2 MRDA: ICSI Meeting Recorder Dialogue Act Corpus

5.1.3 MANtIS: multi-domain information seeking dialogues

5.1.4 MSDialog: Microsoft community support forum

₀

₀

₁

₁

₂

₂