Neural Language Modelling of Dialogue Act Sequences in Varying Contexts

(1)

Neural Language Modelling of

Dialogue Act Sequences in

Varying Contexts

Alexandra M.D. Spruit 11262273

Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. A.J. Sinclair

Institute for Logic, Language and Computation Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

This thesis examines the prediction performance of dialogue act (DA) sequences on the Barcelona English Language Corpus in various contexts, using neural language models. The influence of text embeddings, sequence length and language ability level of the student on the prediction perfor-mance are analysed. Little research has been done on simple DA pre-diction given a history of DAs compared to DA classification. In this thesis, an LSTM model with higher-level input features and a model with additional text embeddings were evaluated against several baselines, in-cluding n-gram models. The performance of the baseline models showed that the research problem of predicting DAs given a sequence of DAs on this small imbalanced corpus was challenging and the accuracies remained low for the improved neural models. The prediction performance of the LTSM did not improve significantly by adding text embeddings to the input feature set. There was also no significant difference between the prediction performances using varying sequence lengths, nor was there a significant difference between language ability levels. However, both se-quence length and language ability influenced the predictability of DA classes indicative of disfluent speech. These results suggest that a larger data set encompassing more languages is needed to investigate the DA patterns characteristic of second language tutoring dialogue.

(3)

1 Introduction

It is widely known that human one-on-one tutoring can have a positive effect on student learning. Now that artificial intelligence is starting to play a bigger role in the everyday lives of people, research after intelligent tutoring systems (ITS)s is becoming increasingly more popular. Human tutors are great at adapting to their students and, thus, personalising tutoring sessions. To make artificial tu-tors as adaptive as human tutu-tors, however, the tutoring strategies that these human tutors use need to be uncovered and adapted to improve the effectiveness of ITSs. A lot of research has already been done on extracting these tutoring dialogue structures in different tutoring settings [22] [4] [14] [29] [3]. The stud-ies mentioned above used tutoring dialogues where the learning objective was Computer Science, physics or mathematics. In this thesis, however, dialogues from a second language (L2) tutoring setting are explored where the learning objective is the language itself.

There is a lot of theoretical research on conversational mechanisms in L2 tutoring and between people in general, but this thesis will make a data-driven approach to the topic. That is why this thesis will focus on related data-driven research instead of theoretical research in Section 2. This thesis will model sequences of annotated utterances, also called Dialogue Acts (DA), using neu-ral language models. The performance on the prediction of the next DA in a sequence will be explored under different input settings and compared across student ability levels. The prediction performance will be evaluated on the Barcelona English Language Corpus (BELC) [16]. The findings will uncover hidden patterns in the dialogues’ structure.

1.1 Research Questions

This thesis aims to answer the following Research Question:

How does prediction performance of dialogue acts vary depending on the context given a sequence of labelled dialogue turns?

This research question can be split into the following subcategories which will form the structure of the analysis of this thesis:

RQ1: How does the addition of textual input data influence the prediction performance?

RQ2: How does the prediction performance vary for different sequence lengths x?

RQ3: How does the ability level of the student influence the prediction performance for models given different sequence lengths?

1.2 Hypotheses

It was expected that the neural language models would learn sequences of DAs, as previous language modelling studies found structures in (tutoring) dialogue, see Section 2. Therefore, the preceding sequence of DAs would facilitate the

(6)

prediction of the DA in the next turn. The subcategories of the research ques-tion above were motivated by the following hypotheses:

(RQ1) It was expected that textual data would increase the accuracy of the predictions because the textual data would add more input features to the model, on which the model could base its predictions. (RQ2) the varying sequence lengths x were likely to influence the

predic-tion performance, as they determine the amount of context that is considered in the predictions.

(RQ3) the neural language models were likely to produce different predic-tions for different language ability levels of the student, as earlier studies working with the BELC corpus found that DAs, DA se-quences and ratios between tutor and student initiative varied with different levels of student ability[25][26][24].

These L2 tutoring dialogue structures will shed light upon the tutoring mech-anisms second language tutors use during tutoring dialogue. This will in turn help improve tutoring agents and help assess human L2 tutors.

1.3 Thesis Overview

The following section, Section 2, will provide background information on ut-terance annotation as it was done in the past. Furthermore, it will introduce several language modelling (LM) techniques, of which this thesis will make use. Section 3 discusses the corpus, models and evaluation methods used in this the-sis. It will show the imbalanced nature of the DA classes in the corpus and the selection of a balanced test set. Afterwards, Section 4 explores the influence of a weighted loss function, the addition of word information to the input data and the sequence lengths on the prediction performance of neural language models. The prediction performance of several baselines and the influence of ability level on the prediction performance are also discussed. Section 5 will then analyse these results to find answers to the research questions in Section 1.1. It will consist of an in-depth analysis of per class errors under different settings and a comparison of accuracy scores between the models and baselines. Section 6 will provide an answer to the research questions and Section 7 will critically evaluate this thesis and propose ideas for future research.

2 Related Work

2.1 Dialogue

In dialogue analysis, dialogues are often analysed on the scale of utterance units [28], parts of speech that are one of the following: “under one intonation contour, (...) bounded by pauses, (..) and constituting a single semantic unit” [6]. Another speech act is the turn, that is denoted by the change of interlocutor [28]. As this thesis uses dialogue transcripts instead of voice recordings, an utterance equals a sentence. For simplicity, an interlocutor can have multiple turns, as sometimes an interlocutor in the dialogue speaks more than one utterance after

(7)

each other. In this thesis, an utterance refers to the transcript of the utterance spoken by an interlocutor and the turn refers to the utterance together with the speaker information.

The next section will explain how utterances are annotated to enable higher level modelling of the dialogues in the data.

2.2 Dialogue Act Labelling

It is desirable to understand dialogue at a higher level than plain text, because this makes the dialogue structures easier to model. One of the first studies providing a way to classify utterances on a higher level used illocutionary acts to annotate them, for example representatives, directives, commissives, expres-sives and declarations [23]. However, the assignment of meaning to utterances is highly dependent on the context. Furthermore, the set of acts chosen to represent the data also influences what a model will be able to learn [3].

Predicting the annotation of the next utterance in the dialogue given a his-tory of dialogue utterances can aid the creation of effective dialogue agents. Because, by knowing the speech act of the interlocutor and the target next speech act, the model has more informed choices for the next utterances. The speech act that is normally used in and is suited to tutoring dialogue analysis is the Dialogue Act (DA) [1] [8] [21] [5].

The biggest difficulty in the process of modelling these dialogue structures, is the absence of enough annotated data. The automatic labelling of utterances with DAs is a task which has been explored using various language modelling techniques [27] [13] [17]. The simple prediction of the next DA given a history of DAs, however, has been less explored, because the data is less available and annotation of DAs is expensive.

When there is more annotated dialogue data in the future, as a result of the improving automatic labelling techniques, it may be useful to predict the next DA so that the generated utterance is conditioned on the predicted role of the utterance which could lead to better, more coherent dialogues. The first step in this process is to predict the DA given a sequence of previous DAs, and to explore where the model gets it right and wrong. To this end multiple different models can be used, that will be discussed in the next section.

2.3 Language Models

Language models (LM)s model the probabilities of sequences of tokens and make predictions based on these learnt probabilities. Both count-based language mod-els, e.g. Markov models using n-grams, and continuous-space language modmod-els, such as neural networks, have successfully been used to classify DAs given di-alogue utterances and sequences of utterances [27] [13] [17], as mentioned in Section 2.2. For example, one data-driven study used the Markov assumption and implemented several models, including n-grams, decision trees and neural networks, to predict and detect DAs given utterances from the SwitchBoard Corpus, a data set consisting of native English phone conversations[27]. More recent studies used recurrent neural networks (RNN)s to add sequential data to the DA classification process, thereby improving the accuracy of DA classi-fication on the Switchboard-DAMSL Corpus [17] and the DSTC4 and MRDA corpora [13]. As the classification accuracy of DAs increases when sequential

(8)

data is used, the DAs appear to be following a structure in the dialogue. Even though this thesis will not classify DAs given utterance texts, these hidden structures might be helpful in the prediction of DAs given a sequence of other DAs too.

Sections 2.3.1 and 2.3.2 explain how n-gram models and LSTM models, a special kind of RNN, work.

2.3.1 N-gram Models

An n-gram model is a language model that breaks a sequence of tokens up into subsequences of n-tokens at the time, counts the total number of occurrences of the unique subsequences in the training set and then computes the probability of occurrence of each subsequence according to those counts. These probabilities can then be used to make predictions on the test set, because the last token in a subsequence given the previous tokens in that subsequence, the history, will most likely be a token that was seen to follow the same history in the training set. Most of the time there will be a distribution of possible last tokens to follow a history. The prediction is than proportionally chosen from that distribution. Sometimes there there is not any prediction for the given history, for that history has never been seen during training. To tackle this problem and make the n-grams more robust, back-off n-gram models are commonly used. These models try to predict according to the history using the highest order of n-grams that was trained on. If this n-gram does not yield a prediction, a lower order n-gram is used with a shortened history until the model backs-off to unigram predictions alone [11].

There are some studies that have used n-gram models successfully on tutor-ing dialogue data. In these studies the tokens were DAs. The study by Litman and Forbes-Riley [14] showed promise in using n-grams to uncover the DAs and DA sequences that correlate with learning in physics tutoring dialogue. They used simple unigram and bigram analyses to find that student learning correlates with both tutor and student acts [14]. In the study by Chen et al., DA n-grams and linear regression models were used to extract DA sequences consisting of prompts, varying instructions and feedback, that correlated with learning gain in Computer Science tutoring dialogue [4]. A later study on Computer Science tutoring dialogue also used unigrams and bigrams to find a correlation between dialogue acts and learning gain, but this time with another, improved DA an-notation scheme that was not biased by one majority class like most other DA annotation schemes [29]. These studies show there are structures in tutoring dialogues on the topics of Computer Science and physics that correlate with learning gain. However, as this thesis uses language itself as the object of learn-ing, the kind of DAs used in these studies are not applicable to second language (L2) tutoring dialogue and the found patterns in the studies might not exist in L2 tutoring dialogue.

This thesis uses the prediction performance of a back-off bigram and trigram model as a baseline for the prediction performance of the LSTM models.

2.3.2 Neural Language Models

Recurrent neural networks (RNN)s are a type of artificial neural networks that process data sequentially. The RNN goes over a sequence of inputs, at each step

(9)

using the current input to adjust an additional set of parameters: the hidden state. The model makes a prediction using the information of the input and the hidden state combined. During training, the generated output is used with a predefined loss function to adjust the parameters (weights) in the direction that will result in the biggest decrease of error, the gradient, during backpropagation through time [30]. The hidden state is then passed to the next step of the model, where it is again adjusted with the new input, still keeping information on the previous input as well. Then this new combination of the hidden state and input are used to adjust the weights of the model again in the same way, until the end of the sequence is reached and the hidden state of the model is reset. During evaluation a similar process occurs, but then the hidden state is just passed to the next step, without evaluating the prediction against a loss function and performing backpropagation through time. There are well-known variants of the simple RNN: the Elman network and the Jordan network [7] [10] [20].

Simple RNNs cannot cope with long sequences due to their architecture. When sequences are longer than a couple of inputs, the gradient starts to be-come either insignificantly small (vanishing gradient problem) or the gradient becomes too big (exploding gradient problem) [2]. These vanishing and explod-ing gradients, cause the loss or over-emphasise of context that is further back in the sequence. Norm clipping the gradient deals with the exploding gradient problem [18], but the vanishing gradient problem needs more adjustments.

The most common way to deal with the vanishing gradient problem is to have a Long Short-Term Memory (LSTM) model [9], a gated version of the RNN, instead of a simple RNN. This gated version contains special cells that decide which information to retain in the hidden state and which information to loose at each step in the sequence. This way, the network will retain important contextual information in long sequences. This thesis uses an LSTM model to model the DA structures for varying longer sequence lengths.

A language modelling technique that is commonly used together with neural language models is word embedding. This technique consists of reducing the dimensionality of the sparse vocabulary word vectors, while at the same time retaining the semantics of the words. This way, the words for similar concepts will have similar embedding vectors. Two public pre-trained models that convert words to their word embedding vectors are word2vec and GloVe [15] [19]. In a simpler sense, embedding can refer to the reduction of the dimensionality of the input vectors. This thesis makes use word embeddings and it reduces the dimensionality non-word input features.

3 Method

3.1 The Corpus

This thesis uses the Barcelona English Language Corpus (BELC) [16], contain-ing 118 dialogues transcripts of English tutorcontain-ing sessions given to Spanish school children. The corpus consists of tutoring dialogues between a tutor/interviewer and a student/participant on four different student language ability levels with level 1 as the lowest ability level and level 4 as the highest ability level. These transcripts were labelled with a subset of the DA annotation set proposed by Stolcke [27], as these were relevant to the corpus [24]. The chosen DAs can be

(10)

seen in Tabel 1 grouped by category and together with the abbreviations used for them in this thesis. For a sample of a dialogue transcript, see Table 2.

DA Abbreviation

statement ST

spanish SP

signal non-understanding SNU

general other question GOQ

wh-question WQ

yes/no question YNQ

backchannel question BQ

declarative yes/no question DQ

backchannel acknowledge BA

response acknowledgement RA

yes answers YA

no answers NA

repeat phrase RP

Table 1: Abbreviations of the Dialogue Acts used grouped by category. The corpus contains data on the utterance text, speaker, DA, dialogue ID and ability level for each turn in the dialogue.

INV GOQ and how old are you ? PAR ST I’m sixteen years old .

INV ST and you’ve never repeated any course then you’re not a repeater .

PAR ST no .

INV WQ and when did you begin English ?

PAR ST I begin hmm .

INV GOQ eight ?

PAR RP eight or nine .

INV BQ yeah ?

INV ST in third year of primary .

Table 2: Dialogue sample from the corpus at ability level 3. ‘INV’ stands for ‘interviewer’/‘tutor’ and ‘PAR’ stands for ‘participant’/‘student’.

(11)

3.1.1 Unigram Statistics

Figure 1: Dialogue Act distribution over the entire data set.

Level Dialogue Count Average Dialogue Length

1 36 120.58

2 44 141.11

3 24 114.79

4 14 136.71

Table 3: The number of dialogues and the average dialogue length per language ability level.

As can be seen in Figure 1 and 2, the DAs do not have a balanced distribu-tion. ‘statement’ is the majority class, with over five thousand counts, taking up 0.35% of the entire data set. All the ability levels had a similar DA distri-bution, but varied a lot in the total number of DA counts, see Figure 2. Level 2 had the most and the longest dialogues of the data set and, therefore, also the most DAs, see Table 3 and Figure 2. Level 2 was followed by level 1 and after that level 3 in the total number of DA counts, the total number of dialogues and average dialogue length. The fourth level had the least dialogues and DA counts. As the first two ability levels had more DA than the third and fourth level, the data was imbalanced over the levels as well. The data set was also relatively small for a machine learning task, only having 118 dialogues with an average of 128 utterances each to train on.

(12)

Figure 2: Dialogue Act counts per ability levels.

Figures 3 and 4 show the average DA distributions per dialogue per speaker for ability levels 1-4 and the average distribution over all the levels. Note that the speaker “student” never used ’declarative yes/no question’s.

The DA distributions varied between the speakers and between the levels. Student DAs that were indicative of lower levels were, intuitively, ‘spanish’ and ‘signal non-understanding’, see Figure 4. Students used ’statement’ the most regardless of the ability level. The graph also shows that the student used the classes ‘repeat phrase’, ‘response acknowledgement’, ‘yes/no question’, ‘yes answers, ‘wh-question’ and ‘general other question’ more often at higher ability levels.

As ability level increases the tutor stopped speaking Spanish and started to use more ‘signal non-understanding’, see Figure 3. Tutors used ‘general other question’ the most and ‘statement’ came second. With increased ability level, the tutor started to ask less ‘general other question’s and more ‘wh-question’s. The use of ’statement’ increased with higher ability levels for both speakers.

(13)

Figure 3: Dialogue Act distribution per ability level for the tutor.

(14)

Figure 5 shows that tutors tended to make longer sentences than students. The difference was the largest at the lower levels, as the tutors made sentences more than twice as long as those of the students. At level three and four, the students started to make longer sentences as well, but the tutors still make the longest. Figure 5 also shows that the tutors contributed more to each dialogue than the students. The dialogue contribution ratios are approximately the same for ability levels 2 to 4, with only a slightly greater difference between speaker and student at level 1. These facts show that tutors contributed on average more utterances to the conversations and used more words in these utterances than students, but the asymmetry was less strong at higher language ability levels.

Figure 5: Average utterance length (right) and dialogue turn ratios (left) per speaker per ability level.

(15)

3.1.2 Bigram Statistics

Figure 6: Bigram distributions for ability levels 1-4. Note: the y-axis only showed the bottom 10%. The bigrams are ordered from most occurring (left) to least occurring (right).

This thesis deals with sequences of utterances, so DA bigrams were used to extract useful information on these statistics. In Figure 6 the distributions of all the bigrams found at a specific level are plotted. The levels contained 239.5 unique bigrams on average. The distributions of bigrams are flat, see Figure 6. Note that the y-axis only shows the bottom 10% and that the peak at the beginning is, therefore, not a big one.

The most occurring bigram for all the levels is the ‘tutor’-‘student’ bigram ‘general other question’-‘statement’, see Appendix A Figures 25, 26, 27 and 28. The second most occurring bigram at levels 1, 2 and 3 is the ‘student’-‘tutor’ bigram ‘statement’-‘general other question’, but at level 4 it is ‘student’-‘tutor’ ‘statement’-‘statement’. The top-5 most occurring bigrams for each level are made up of combinations of the four largest represented classes: ‘statement’, ‘general other question’, ‘spanish’ and ‘wh-question’, see Appendix A. Therefore, the distribution of dialogue acts has an influence on the kind of bigrams found in the dialogue.

Table 4 shows that DAs are most commonly preceded by the same DA on each level. As noted above, ‘statement’ as spoken by a student is mostly preceded by the ‘general other question’ of a tutor. In turn, a tutor that asks a ‘general other question’ is mostly preceded by a student that makes a ‘statement on each level, see Table 4. Some pairings change depending on the ability level, for example, a student speaking ‘spanish’ is mostly preceded by a tutor asking

(16)

a ‘general other question’ on ability level 1 and 2. However, a student speaking ‘spanish’ is mostly preceded by a tutor making a ‘statement’ at ability level 3 and asking a ‘wh-question’ at ability level 4. Another interesting change is that of the precedent and speaker of ‘signal non-understanding’. At the first two levels, ‘signal non-understanding’ is mostly spoken by a student and preceded by the ‘general other question’ of the tutor. At level 3 and 4, this changes, as the ‘signal non-understanding’ is now mostly spoken by the tutor and preceded by the student making a statement. Furthermore, a ‘repeat phrase’ is mostly spoken by the tutor and preceded by a ‘statement’ of the student at level 1, 2 and 3, but at level 4, the ‘repeat phrase’ is mostly spoken by the student and preceded by a ‘signal non-understanding’ of the tutor.

level 1 level 2 level 3 level 4

DA pre-DA S. pre-DA S. pre-DA S. pre-DA S.

ST GOQ T-S GOQ T-S GOQ T-S GOQ T-S

GOQ ST S-T ST S-T ST S-T ST S-T

SP GOQ T-S GOQ T-S ST T-S WQ T-S

WQ ST T-T ST T-T ST S-T ST S-T

SNU GOQ T-S GOQ T-S ST S-T ST S-T

YA GOQ T-S GOQ T-S GOQ T-S GOQ T-S

BA ST S-T ST S-T ST S-T ST S-T

YNQ ST T-T ST T-T ST S-T ST T-T

BQ SNU S-T YA S-T YA S-T YA S-T

RA ST S-T ST S-T ST S-T ST S-T

NA GOQ T-S GOQ T-S GOQ T-S GOQ T-S

RP ST S-T ST S-T ST S-T SNU T-S

DQ ST S-T ST S-T ST S-T ST S-T

Table 4: The most common predecessors for each DA per level together with

the kind of bigram they form speaker wise. ’S.’ = speakers, ’T’ = tutor and ’S’ = student.

3.2 Simple Baselines

Three simple baselines were used to set expectations for how hard the data in the data set is to predict. The fist simple baseline was the majority class baseline. This baseline predicted only the majority class ’statement’ for all the test set. The second simple baseline, the random class baseline, predicted a random DA for each label out of a uniform distribution. As the data set is imbalanced, a third baseline, the weighted random baseline, was used. This baseline predicted a DA for each label according to the DA distribution found in the corpus, see Section 3.1.1.

Although these baselines are a good first indication, they are still too simple. Therefore, n-gram models were added as a smarter baseline for the evaluation.

3.3 N-gram Models

N-gram models have already performed well on sequence prediction language tasks and their performance is generally seen as a good baseline in LM tasks [11], see Section 2.3.1. This thesis used a bigram and a trigram LM specifically.

(17)

When the n-gram model could not find a prediction for a given sequence, it would back of to a smaller n-gram model, as this makes the model more robust [11], see Section 2.3.1. See Figure 7 and 8 for a diagram of the bigram and trigram models respectively.

Figure 7: Diagram of the prediction process of the bigram model.

Figure 8: Diagram of the prediction process of the trigram model.

3.4 Neural Language Models

In order to address the dialogue act sequence prediction using state of the art language modelling (LM) techniques, a standard LSTM was employed, see Sec-tion 2.3.2, which, given a sequence of DAs, will predict the next DA in the sequence.

This thesis explored varying the number and type of input features to the LSTM, with two main variants: using dialogue utterance features, and adding word features to the model. This required two different architectures which are described in Sections 3.4.1 and 3.4.2. The first variant of the model made predictions for a sequence of three dialogue act turns at the time and the second

variant of the model made predictions for multiple sequence lengths. Both

models used embeddings for the input, see Section 2.3.2 for more information on embeddings.

3.4.1 Neural Language Model with High-Level Input Features

3.4.1.1 Input Features

The first LSTM model took in the four data features: dialogue act, speaker, level and utterance length. The utterance length was the length of the ‘text’ feature in the original data frame.

The dialogue act feature was chosen because of the information it holds on the next dialogue act in the sequence. For example, if a ‘general other question’ is asked, it is more likely that it will be answered by a ‘statement’ or a class in the ‘answers’ category than it will be answered by another class in the ‘question’

(18)

category. The speaker feature was chosen as it is more common to have a bigram consisting of two different speakers and trigrams consisting of only one speaker are very rare because they are more a monologue than a dialogue. Therefore, the speaker feature also holds valuable information on the likelihood of a next DA. The added value of the level feature is less obvious, but Section 3.1.2 Figure 4 and Appendix A show that levels do have an influence on the distribution of bigrams. At one level, a DA will be more likely to be followed by a specific DA than on another level. At last, utterance length was included as an input feature, because the utterance length contains implicit information on all the previous features combined. Some DAs are specifically short, for example, ‘yes answers’ and ‘signal non-understanding’. Furthermore, as shown in Figure 5, the length of utterances varies between levels and between speakers.

The output was the DA that was predicted to be in the DA of the next turn. The dialogue act, speaker and level were embedded individually and then concatenated, together with the scalar value of utterance length, to form an input vector for each dialogue turn. Three subsequent dialogue turn’s input vectors formed sequences of length three. The model trained with batch size 16 and 20 epochs.

3.4.1.2 Weighting

The data set was greatly imbalanced, as was mentioned in Section 3.1.1 and shown in Figure 1. This imbalance would greatly affect the patterns learnt by the model. The model would, as a consequence, overfit on the majority classes and the minority classes might not be predicted at all. As this thesis was interested in the role of the less occurring classes as well, it was important that they too would be predicted. To deal with this problem, the common method of a weighted loss function (WLF) was used [31].

The model had two different settings, ‘weighted’ and ‘unweighted’, to inves-tigate the effect of weighting on the prediction performance. The weights for the DAs were computed using the DA distribution visualised in Section 3.1.1 Figure 1. The aim was to make the weight of all the DAs during training equal. In other words, trick the model to believe the DAs to be uniformly distributed by giving the less represented DAs more importance in each time step. To this pur-pose, the proportion of the majority class statement was taken and the weight for each individual DA was calculated by dividing the majority class proportion by the proportion of the individual class, see Equation 9. This way, the weight of all the DA classes during training would be equal.

wx=

max(D)

P (x) (1)

Figure 9: The calculation of the weight during training for an individual DA. x is the DA for which the weight must be calculated, wx is the weight for DA x,

D is the list of distributional values corresponding to the visualisation in 1 and P (x) is the distributional value of x in D.

The unweighted setting did not adjust the weight of the individual DAs in the loss function according to the DA distribution. The more frequently occurring

(19)

classes had, therefore, more impact on the adjustment of the parameters of the model than the less frequently occurring classes.

3.4.1.3 Model Architecture

Figure 10: Diagram of the models without sentence embeddings’ architecture. It takes in one input vector at the time consisting of the concatenations of the feature embeddings and it outputs a dialogue act in vector representation. The sequence length is three, the model goes through this process three times, each time considering information about the previous input with prediction of the current input.

The number of parameters in the network had to be kept to a minimum to prevent overfitting on the small data set. Furthermore, the model had only thirteen output classes and relatively small embedded input dimensions, see Sec-tion 3.4.1.4 for the exact values of the embedding dimensions, which suggested that it was not a complex problem. With this constraint and assumption, it was decided that only one hidden layer would be used. 2-layer neural networks with non-linear activation functions are universal approximators [12] and are, therefore, widely believed to be sufficient for simple problems. A diagram of the model can be seen in Figure 10.

3.4.1.4 Hyperparameter search

To find the optimal combination of hyperparameters from learning rate lr ∈ {0.01, 0.001, 0.0001}, hidden dimension hidden ∈ {12, 16, 20} and embedding dimensions for (speaker, DA, level) emb ∈ {(1, 7, 2), (2, 13, 4)} for both the weighted and unweighted version of the model, a grid search using the accu-racy measure together with cross-validation on the training set was performed, see Appendix B.1 for the results.

The best cross-validated accuracy on the training set was obtained with a learning rate of 0.001 and a hidden layer of dimension 16 for both models. The

(20)

optimal embedding dimensions were different for both models. The weighted model performed best with embedding dimensions of 1, 7 and 2 for speaker, DA and level respectively, reaching a classification accuracy of 20.69 % on the validation set. The weighted model performed best with the larger embedding dimensions of 2, 13 and 4 for speaker, DA and level respectively, hereby setting the embedding dimensions to the original number of classes of the individual features. This setting gave a classification accuracy of 36.58 % on the validation set. As the utterance length was concatenated to the other embeddings in the form of a scalar, the input dimension of the weighted model became 11 and the input dimension of the unweighted model became 20.

The accuracy of the unweighted model was higher than that of the weighted model because the unweighted model was more likely to predict majority classes and was, therefore, more prone to be correct. The weighted model, however, predicted more minority classes, so the average per-class accuracy was expected to be higher.

The four input features, DA, speaker, level and utterance length mentioned in Section 3.4.1.1 where chosen to optimise the amount of information the model would base predictions on and, thus, optimise the accuracy. However, the overall low accuracy outputted by the hyperparameter search inspired an attempt at maximising the input features and, therefore the data, even further in the form of a sentence embedder that was added to this model, which will be discussed in the next subsection.

3.4.2 Neural Language Model with High-Level Input Features and

Sentence Embeddings

3.4.2.1 Input Features

To build on and optimise the model discussed in the previous subsection, a sentence embedding was added to the already embedded features Dialogue Act, speaker, level and utterance length. The output stayed the same. The sentence embedding was constructed by creating word embedding vectors of dimension 50 for the words in the ’text’ feature of the dialogue turn and then taking the vector sum of these word embeddings. The pre-trained word embedder GloVe was used to make the word embedding vectors [19].

(21)

3.4.2.2 Model Architecture

Figure 11: Diagram of the architecture of the model with sentence embedding. It makes word vectors of the words in an utterance ‘text’ and then sums the word embedding up to a sentence vector. The model takes in one input vector at the time consisting of the concatenations of the feature embeddings, now including the sentence embedding, and it outputs a dialogue act in vector representation. The sequence length is variable, so the model goes through this process 2, 3, 5, 7, 10, 15 or 20 times, each time considering information about the previous input with prediction of the current input. The diagram shows 2 steps in a sequence.

The model with sentence embeddings only used the weighted setting, as the model without sentence embeddings showed the weighted setting to perform better on the minority classes. Due to the sentence embeddings, the amount of input information per dialogue turn was maximised. Therefore, the model was thought to perform substantially better on the prediction of all DA classes. A diagram of the model can be seen in Figure 11.

3.4.2.3 Heuristic Hyperparameter selection

The grid search of the previous model without sentence embedding showed that only one learning rate gained the highest accuracy consistently for both models for all the settings of the remaining hyperparameters. Therefore, it was decided that the learning rate of 0.001 should work for the new model as the input and hidden dimensions would not change significantly enough to impact it.

To make comparisons, the model with sentence embeddings should be as much like the model without sentence embeddings as possible. The model with sentence embeddings kept the embedding dimensions for the features speaker, DA and level that were optimal given the weighted setting of the model without sentence embeddings. The concatenation of the 50 dimensions long sentence embedding vector would, therefore, make an input dimension of 61. Due to

(22)

this increase in the dimensionality of the input, the dimensions of the hidden layer were also increased. To inform the choice of this new dimensionality, another grid search was performed on the training set with accuracy and cross-validation, this time only varying the hidden dimension from within the set hidden ∈ {30, 35, 40, 45, 50}, see the table in Appendix B.2. The grid search found that the optimal hidden layer dimension was 50 with a classification accuracy of 18.77% on the validation set.

3.5 Evaluation Methods

Figure 12: Dialogue act distributions of the training and test set individually. To evaluate the models, the data set of 118 dialogues and four language ability levels has been split into a training set of 110 dialogues and a test set of eight dialogues. The test set contained an even sample of dialogues from each language ability level. All the test dialogues were selected on having one or multiple occurrences of each DA and on having a distribution similar to that of the training set, see Figure 12 for the exact distributions of both the training and the test set.

The models were first evaluated on the test set with the classification ac-curacy measure on all the DA classes combined. Due to the imbalance of the data set, however, the models are also evaluated with per class precision, recall and f1-scores and an average per class precision, recall and f1-score. The per class and average per class accuracies, yield a more balanced score that does not favour the correct prediction of majority classes over the correct prediction of minority classes. The accuracies for the different levels are also evaluated with

(23)

saverage= n X i=1 s(i) n (2)

Figure 13: The computation of the average per class precision, recall and/or f1-score. s is the accuracy measure used s ∈ {f 1 − score, precision, recall}, saverage is the average per class accuracy score and n is the number of classes

(in this case 13, as there are 13 unique DAs).

an average per-class f1-score for each level. The average per class accuracy is calculated as in Equation 13

An easy example is the recall score for the majority class baseline ’statement’. As ’statement’ is always predicted, there are no false negatives. Therefore, the recall of ’statement’ for the majority class is 1. However, the recall for all the other DA classes will be 0, as those classes do not have any true positives. The average per class recall is, therefore:

recallaverage= 13 X i=1 recall(i) 13 = 1.0 + 12 ∗ 0 13 = 1 13 ≈ 0.0769 (3)

Thus, the average per class recall for the majority class baseline is 7.69%. Due too the small size of the data set and the imbalance, it sometimes occurred that a DA class would not be predicted at all. In this case, the precision would be a NaN value, as both the true positives and false positives for that DA would be zero and a division by zero is not mathematically possible. Normally, this would mean that it is just unknown how good the precision for that class is, however, in this specific case, it was expected that every DA class would be predicted at least once. Therefore, it was decided that the total absence of predictions of a DA class would not be considered as a NaN precision, but as 0 precision. This makes sense, as the absence of any predictions for the DA is the worst accuracy that could be obtained in this context. The recall could never be NaN as all the DA classes were always in the data set, so there would always be false negatives at the worst. By setting the precision to zero in the case of absence of predictions, the f1-score would also become 0 instead of NaN and, therefore, the unpredicted classes could still be included in the average per class calculations and the graphs.

The models were also evaluated with significance tests on their difference using the t-statistic, p-value and Cohen’s D between model pairings.

4 Results

This section will first show the results of the classification accuracy for the LSTM models on the test set. Then it will show the normalised confusion matrices of the labels and predictions for the different models used in this thesis. Then the results of the per class and average per class evaluation on the test set for the different models will be shown.

The accuracy of the weighted model without sentence embeddings on the test set was 21.09 %. For the unweighted version of this model, the accuracy on

(24)

the test set was 38.56 %. The accuracies of the weighted model with sentence embeddings for different sequence lengths are shown in Table 5

sequence length accuracy (%)

2 21.06 3 20.68 5 20.88 7 19.45 10 20.52 15 20.46 20 20.40

Table 5: Accuracies on the test set for the model with sentence embedding

given different sequence lengths.

These scores give an idea of the overall accuracy regarding all the classes, but, due to the imbalance in the data set, it does not say a lot about the per class performance of the model, as mentioned in Section 3.5. Therefore, the other results in this section will only focus on per class and average per class evaluation.

(25)

4.1 Confusion Matrices

4.1.1 Baselines

Figure 14: Confusion matrices for the prediction performance of the baseline models. The y-axis contains the labels and the x-axis contains predictions for each label on the y-axis. The colours are the representation of the prediction percentage per label and go from white (0%) to dark blue (100%). The columns sum op to 100% for each label.

Figure 14 contains the heatmaps showing the prediction percentages for each DA for each label. The most basic graph, the majority class baseline, contains only ‘statement’ predictions. There is no real pattern visible in the random class baseline, as the DAs are predicted at random from a uniform distribution.

(26)

The weighted random class shows how for every label the majority classes are predicted the most and the prediction percentages grow smaller towards the minority classes, as the weighted random class predicted DAs according to the DA distribution seen in Section 3.1.1 Figure 1.

The trigram model predicts more accurately than the weighted random model as can be seen through the diagonal that is slightly visible in the graph, see Figure 14. The boxes on the diagonal of the matrices correspond to a cor-rect label and prediction pair. The bigram model does not improve significantly on the weighted random model (p = 0.4984), as can be seen in their similar distributions, see Figure 14.

4.1.2 Neural Language Model with High-Level Input Features

Figure 15: Confusion matrices for the model without sentence embeddings’

prediction performance. The y-axis contains the labels and the x-axis contains predictions for each label on the y-axis. The colours are the representation of the prediction percentage per label and go from white (0%) to dark blue (100%). The columns sum op to 100% for each label.

The unweighted LSTM model without sentence embedding, the old unweighted model, only ever predicted 8 out of 13 DA classes: ‘backchannel acknowl-edge’, ‘backchannel question’, ‘general other question’, ‘repeat phrase’, ‘span-ish’, ‘statement’, ‘wh-question’ and ‘yes answers’, see Figure 15. As could be seen in Section 3.1.1 Figure 1, most of these belong to the most occurring DAs in the data set. The old unweighted model predicted the majority class ‘statement’ the most, regardless of what the DA label was. The second most predicted DA was ‘general other question’, the second most occurring DA in the data set. The unweighted model was also good at predicting the less occurring classes ‘backchannel acknowledge’ and ‘backchannel question’ correctly relative to other less occurring classes. The old unweighted model will not be included in the rest of the results, because the model misses 5 out of 13 of DA classes and is not complete in comparison with the other models, although the prediction performance of the unweighted model was not significantly different from the prediction performance of the weighted model (p = 0.4643).

The weighted LSTM model without sentence embedding, the old weighted model, showed more accurate results. As can be seen in by the darker colours on the diagonal in Figure 15, the correct label and prediction pairings occur often

(27)

relative to other label prediction pairings. There are some notable outliers, for example, ‘yes answers’ are mostly predicted when the prediction should have been ‘no answers’. Figure 15 also shows that a ‘response acknowledgement’ is mostly mistaken for a ‘backchannel acknowledgement’ when it is not predicted correctly. However, the second most predicted DA for ‘backchannel acknowl-edgement’ is in turn ‘response acknowlacknowl-edgement’. Another interesting thing to note is that where the unweighted model predicted ‘general other question’ the most for all labels after ‘statement’, the weighted model hardly ever predicts ‘general other question’. ‘statement’ is still predicted for most of the labels a large part of the time, but it is not the highest occurring prediction for any other label than ‘statement’ itself anymore. ‘yes/no questions’ are hardly ever predicted regardless of the label.

4.1.2.1 Per Level

Figure 16: Confusion matrices for the weighted model without sentence

em-beddings’ prediction performance per level. The y-axis contains the labels and the x-axis contains predictions for each label on the y-axis. The colours are the representation of the prediction percentage per label and go from white (0%) to dark blue (100%). The columns sum op to 100% for each label.

Figure 16 shows the old weighted model’s differences in the prediction perfor-mance per level. The first and second level have similar prediction patterns, but they differ from the prediction patterns in level 3 and 4, that in turn are more similar to each other. Note that the first level never predicts ‘declarative yes/no question’. ‘general other question’ and ‘yes/no answers’ are hardly ever predicted on all the levels.

(28)

The highest measured predictions for almost all the classes are on the di-agonal for level one. The most notable outliers, where the highest prediction is not on the diagonal are the false prediction of ‘response acknowledgement’ as ‘backchannel acknowledge’ and the false prediction of ‘yes/no’ question as ‘wh-question’.

The predictions of Level 2 are similar to those of level 1, but level 2’s pre-dictions are less accurate as the highest prepre-dictions of most classes are not on the diagonal, even though the prediction percentages of the correct predictions high relative to the other predictions too. Level two makes the same mistakes as mentioned above at level one, but now the percentage of ‘no answers’ mis-classified as ‘yes answers’ is even higher than in level 1. This trend will have even also be visible in level 3 and 4 where ‘no answers’ are misclassified as ‘yes answers’ most of the time.

Level 3 and 4 have a bias towards ‘yes answers’ and ‘no answers’ are hardly ever predicted, although ‘no answers’ predictions do have a peak in their distri-bution for the correct label. At level 3 and 4 the model has a larger bias towards the majority class ’statement’ than in the lower levels, as shown by the dark column for statement relative to the other columns. At level 3 a ‘declarative yes/no question’ is mistaken for a ‘response acknowledgement’ a lot and at level ‘signal non-understanding’ is often mistaken for ‘backchannel acknowledge’.

The prediction performance of the old unweighted model per level can be seen in Appendix C.2.

(29)

4.1.3 Neural Language Model with High-Level Input Features and Sentence Embeddings

4.1.3.1 Best Performing Sequence Lengths and Longest Sequence

Length over All Levels

Figure 17: Confusion matrices of the model with sentence embeddings for the best performing and the longest sequence lengths over all the levels and the average over all the sequence lengths and all the levels. The y-axis contains the labels and the x-axis contains predictions for each label on the y-axis. The colours are the representation of the prediction percentage per label and go from white (0%) to dark blue (100%). The columns sum op to 100% for each label. The prediction heatmaps per sequence length for the model with sentence em-bedding show less clear diagonals than the heatmap for the weighted model without sentence embeddings, see Figures 15 and 17 and Appendix C.1 Figure 29. The clearest diagonals are visible for sequence lengths 2 and 10, see Fig-ure 17. All the sequence lengths seem to have a high prediction rate for both ‘wh-questions’ and ‘statement’ for all the labels relative to other predictions, vi-sualised by the darker coloured columns for those DAs. Although ‘general other question’ and ‘spanish’ are among the most occurring DAs in the data set, see Section 3.1.1, they are not predicted as much as the other main classes in the data set: ‘statement’ and ‘wh-question’. The DAs ‘statement’, ‘spanish’, ‘yes answers’ and ‘backchannel question’ have the highest prediction score for the correct prediction given most sequence lengths. The prediction distributions for the other classes vary a lot.

Some of the patterns seen in the old weighted model can be seen in the model with sentence embedding as well, see Figure 15 and 29. ‘no answers’ are

(30)

often falsely predicted as ‘yes answers’ given any sequence length and ‘yes/no question’s are often mistaken for ‘wh-question’s. In the model with sentence embeddings, though, there is a less clear pattern where ‘response acknowledge-ment’ is falsely predicted as ‘backchannel acknowledgement and vice versa.

The heatmaps are the lightest at the upper right corners for all the sequence lengths. This means that minority classes are the least incorrectly predicted predictions for majority class labels.

Sequence length 20 has the greatest bias towards the majority class ’state-ment’ of all the sequence lengths, see Figure 17 and Appendix C.1 Figure 29.

4.1.3.2 Per Level Averages over All Sequence Lengths

Figure 18: Average confusion matrix per level over all the sequence lengths. The y-axis contains the labels and the x-axis contains predictions for each label on the y-axis. The colours are the representation of the prediction percentage per label and go from white (0%) to dark blue (100%). The columns sum op to 100% for each label.

Figure 16 shows that the average prediction performance per level over all the sequence lengths varies between the levels. Level one does not have any predic-tions for the label ’repeat phrase’. Furthermore, the misclassification of ‘declar-ative yes/no question’ as a ‘response acknowledgement’ is a notable outlier. At level 1, the models trained on different sequence lengths are biased towards the prediction of majority classes.

At level 2, ‘signal non-understanding’ is only marginally predicted. The models for different sequence lengths have a bias towards predicting the majority classes and in particular ‘wh-question’ and ‘yes answers’. It can be concluded

(31)

that the error at level 2 is high for all the sequence lengths, as there is no clear diagonal visible in the figure.

Level 3 has a clearer diagonal than level 1 and 2. However, there seems to be a strong bias towards the prediction of ‘statement’ and ‘wh-question’.

Level 4 ‘spanish’ is only marginally predicted. There is also no clear diagonal. The patterns that are generally visible in the graphs for the different sequence lengths over all the levels, talked about in Section 4.1.3.1, are the most clearly distinguishable in the average of level 4 over the sequence lengths.

4.2 Accuracy of Different Models

4.2.1 Average per Class F1-Score Model Comparison

Figure 19: Average per class f1-score for each model. The ’Sequence length’ models are the weighted text embedding LSTM models for different sequence lengths. The ’old’ model is the weighted LSTM model without text embeddings. Figure 19 and Appendix D Table 8 show that the f1-score performance of the model with sentence embeddings is not significantly different between different sequence lengths (p > 0.05). Neither is the model with sentence embeddings sig-nificantly different from the weighted model without sentence embedding given any sequence length (p > 0.05). The f1-score performance of all the baseline models is not significantly different among themselves either (p > 0.05), except from the random baseline that is significantly worse than the trigram model baseline (p = 0.0223).

As the model with sentence embeddings functioned the best for sequence length 2 and 10 and the worst for sequence length 20, their significance of the difference in f1-score performance compared to the baselines was investigated. The performances of the model with sequence lengths 3, 5, 7 and 15 would all lie somewhere between these best and worst performances. The results show that the performance with sequence length 20 was only significantly better than the random class baseline (p = 0.0099). The performance with sequence lengths 2 and 10 were both significantly better than the majority class baseline (p2 =

(32)

0.0184;p10 = 0.0129), the random class baseline (p2= 0.0015;p10 = 0.0004) and

the weighted random baseline (p2= 0.0455;p10 = 0.0313). None of the LSTM

models with sentence embedding was significantly better than either the bigram and trigram model baseline(p > 0.5).

The comparison of the weighted model without sentence embeddings to the baselines show that the weighted model without sentence embeddings is also significantly better than the majority class (p = 0.007), random class (p = 0.0001) and weighted random class (p = 0.0159) baselines, see Appendix D Table 8. The weighted model without sentence embeddings is not significantly better than the bigram (p = 0.0658) and trigram model (p = 0.2099) baselines.

4.2.2 Per Class F1-Score per Model Comparison

Figure 20: F1-score per dialogue act for each model. The ’Sequence length’ models are the weighted text embedding LSTM models for different sequence lengths. The ’old’ model is the weighted LSTM model without text embeddings. The f1-scores of the individual DA classes for each model are shown in Figure 20. There were not any correct predictions for ‘declarative yes/no answer’ from the model with sentence embedding for any sequence length nor the majority class, weighted random class and n-gram baselines.

Looking only at the f1-scores for the model with sentence embeddings for dif-ferent sequences, some upward and downward trends are visible when language ability level increases. The f1-scores for the classes ‘general other question’, ‘wh-question’, ‘backchannel acknowledge’, ‘backchannel question’, ‘response ac-knowledgement’ and ‘repeat phrase’ approximately increase as the level in-creases. However, the f1-scores for ‘spanish’, ‘signal non-understanding’ and ‘no answers’ decrease when the level increases.

(33)

4.2.3 Average per Class F1-Score per Level per Model Comparison

Figure 21: Average per class f1-score per level for each model. The ’Sequence length’ models are the weighted text embedding LSTM models for different sequence lengths. The ’old’ model is the weighted LSTM model without text embeddings.

Figure 21 shows that there is not a significant difference in prediction perfor-mance between different levels for the simple baselines: majority class, random class and weighted random class. There is also no significant difference in per-formance between levels for the weighted model without sentence embedding.

The n-gram baselines and the model with sentence embedding on different sequence lengths, however, show greater variation in performance between the levels. Therefore, the significance scores, t-statistic, p-value and Cohen’s D were calculated for the n-gram models and the model with sentence embeddings. The results can be seen in Appendix D Table 9. The results show that there is not any significant difference between the prediction performances of different levels for any of the models with sentence embedding (p > 0.5) nor the n-gram model baselines (p > 0.5).

(34)

4.2.4 Per-Class Prediction Performance per Level for Sequence Lengths 2, 10 and 20

Figure 22: F1-score per dialogue act per level for the embedded text model for sequence length 2.

(35)

Figures 22, 23 and 24 show a great variation in prediction performance for most DA classes between the levels. For most DAs, one or more levels never predict the DA correctly. The model has more correct predictions per DA class over all the levels for sequence lengths 2 and 10, see Figures 22 and 23, than for sequence length 20, see Figure 24. For sequence length 20 there are no correct predictions at any level for ‘signal non-understanding’, ‘yes/no question’, ‘no answers’ and ‘declarative yes/no question’ as could also be seen in Section 4.2.2 Figure 20.

Furthermore, Figures 22, 23 and 24 show that the prediction performances for the DAs are better at higher levels than at lower levels for all three sequence lengths, except for ‘spanish’ and ‘signal non-understanding’, that are less accu-rately predicted at higher levels, and ‘response acknowledgement’ that also has a slight downward trend over the levels for sequence lengths 2 and 20. Level 2 has the lowest accuracy for almost all the DAs, which is emphasised by the missing bars at level 2 for many DAs for all three sequence lengths.

5 Analyses

5.1 Corpus

The imbalance of the data set visualised in Section 3.1.1 Figure 1, together with the small size of 118 dialogues with only 128 utterances on average, makes the data set not optimal for the training of any machine learning algorithm. The overall less optimal accuracy scores seen in Section 4 were, therefore, to be expected. Furthermore, there was an imbalance in the amount of data per levels as well, with language ability level 2 having the most data and level 4 the least, as visualised in Section 3.1.1 Table 3 and Figure 2. This could influence the prediction performance for the individual levels. As level 4 had fewer data to train on than level 2, relatively lower accuracy scores for level 4 and relatively higher scores for level 2 were expected.

(36)

The bigram distributions of unique DA bigrams per language ability level, see in Section 3.1.2 Figure 6, show a flat distribution with a small peak at the beginning for the most occurring bigrams. None of the on average 239.5 bi-grams are seen more than approximately 8% of the time, while most bibi-grams are seen even less than 1% at of the time. This large amount of unique bigrams, together with their flat distribution, means that there are many local decisions made in the dialogue. Intuitively, this is not surprising because there is a wide variety of DAs that can follow, for example, a statement. When a statement is made, the interlocutor that was spoken to can decide to make a statement as well, ask a question, acknowledge the statement and so on. This is also shown in Section 3.1.2 Table 4, where ‘statement’ is the most occuring predecessor of ‘general other question’, ‘wh-question’, ‘backchannel acknowledge’, ‘yes/no question’, ‘response acknowledgement’ and ‘declarative yes/no question’ on ev-ery language ability level. This shows that there are many DA combinations to chose from and that the decision can be made based on the current DA in the dialogue alone. Therefore, there is most likely not a consistent dialogue sequence-structure present that can be learnt.

However, there is still a peak in the distribution of dialogue acts that occur between 1-8% of the time. So, there are more common combinations, see Section 3.1.2 Table 4 and the figures in Appendix A. There are patterns found for the most common DA pairings at all levels, see Section 3.1.2. This means that, although there are many local decisions and many possibilities, the choices made by the interlocutors are not fully random.

5.2 Error Analysis

5.2.1 Baselines

The low prediction percentages on the diagonals of the confusion matrices for the baselines in Section 4.1 Figure 14, showed the prediction of the correct DA given this data set is not an easy task, as was already to be expected, see Section 5.1, because of the flat bigram distributions described in Section 3.1.2.

The random class baseline was chosen as a lower bound baseline as it did not have any knowledge about the data set. Therefore, the LSTM models needed to be able to significantly beat at least this baseline to prove they are better than a model that predicts classes at random. The majority class baseline was chosen as a baseline, because the data set was greatly imbalanced and the majority class baseline would, therefore, predict correctly 35% of the time, see Section 3.1.1. The majority class baseline misses predictions for every class except ‘statement’ and the random class predicted uniformly regardless of the input and, therefore, rarely made correct predictions. These two baselines were chosen because they were not expected to result in high prediction performance and their prediction ratios for each label confirm this.

Both the weighted random baseline and the n-gram model baselines were chosen because they had knowledge about the distribution of the classes and made predictions accordingly. To beat these baselines, the LSTM models needed to learn more about the data set than just the distribution of the classes. The n-gram models also had sequential information and made predictions based on that knowledge, which would make them even harder to beat than the weighted random model. The n-gram models, however, did not perform significantly

(37)

better than the weighted random model, see Section 4.2.1. The weighted random class baseline and the bigram model baseline performed similarly because they were biased towards the majority classes. This bias, however, would increase the number of correct guesses. Therefore, these models were expected to be better than the majority class and the random class baselines and this was confirmed, although there was no significant difference, except for the trigram model, that was significantly better than the random class baseline, see Section 4.2.1. The trigram model performed slightly better than the bigram baseline and the weighted random baseline, although it was still greatly biased towards ‘statement’ and the difference between these models was not significant, see Section 4.2.1.

5.2.2 Model with High-Level Input Features

As can be seen in Section 4.1.2 Figure 15, the unweighted model without sen-tence embeddings was so biased towards the majority classes that it never pre-dicted some minority classes. This was to be expected for an unweighted model that trains on a greatly unbalanced data set, but it also meant the model was not fit for later comparisons to other models regarding minority class predic-tions. Therefore, only the weighted LSTMs were considered for the inter-model-comparison.

The confusion matrix for the weighted model without sentence embeddings shows it was not biased anymore towards the majority classes, see Section 4.1.2 Figure 15. Surprisingly, the model had rarely any predictions of ‘general other question’ even though it is the second most occurring class in the data set. Due to the weighting, the model is less biased towards the majority classes, so this says something about the predictability of ‘general other question’. The confusion matrix seems to imply that other types of questions, like ‘wh-question’ and ‘backchannel question’, are easier to predict than ’general other question’s. This could be because other questions are asked in more specific situations. Another class that is seldom predicted is the ‘yes/no question’ class. This is again an example of how some questions appear to be harder to predict than others.

What is interesting, is that the model often mistakes an answer DA for an-other answer DA, an acknowledgement DA for anan-other acknowledgement DA and a question DA for another question DA. This means that the model does learn some higher-level categories like ‘questions’, ‘acknowledgements’ and ‘an-swers’, although it might get the subcategory prediction wrong. Therefore, there might be more higher-level patterns in dialogue, for example, that answers are always followed by questions, even when it is hard to predict the specific kind of answer in that situation. This could also mean that the classes are too de-tailed to find a structure, but that with higher-level classes the structures in the dialogue might become more predictable.

This grouping of higher-level classes is also visible per level, see Section 4.1.2.1 Figure 16. Furthermore, the figure showed a difference in patterns be-tween the lower levels 1 and 2 and the higher-levels 3 and 4, while lower and higher-levels were very similar in patterns between themselves. This could mean there is a change in dialogue happening between the second and third level, that might indicate a turning point in the gradual language ability of the student. For example, on level 3 ‘signal non-understanding’ is suddenly used most by

(38)

the tutor in contrast to the use by the student at lower levels, see Section 3.1.2 Table 4. This kind of change could mean a change in the teaching strategies of the tutor from level 3 onward.

5.2.3 Model with High-Level Input Features and Sentence

Embed-dings

The confusion matrices for the models with sentence embeddings with different sequence lengths and the average over all the sequence lengths showed that the prediction performance was slightly less accurate for the model with sentence embeddings than for the model without sentence embeddings, see Section 4.1.3.1 Figure 17 and 4.1.2 Figure 15. The incorrect prediction of the detailed classes belonging to higher-level categories, that was visible in the model without sen-tence embeddings, was again visible in the model with sensen-tence embeddings. This means that regardless of the sequence length used, the higher-level ab-stractions of the individual classes are learnt to some degree.

There were slight variations in the prediction performance of the model for different sequence lengths, with sequence length 2 and 10 showing the most accurate performance and sequence length 20 the least accurate. These ob-servations are confirmed in Section 5.3. Furthermore, sequence length 20, the longest sequence length, has made the model more biased towards the major-ity class ’statement’ than models using smaller sequence lengths. One notable difference between the model without sentence embeddings and the model with sentence embeddings is that the model with sentence embeddings mistakes ma-jority classes for minority classes less often than the model without sentence embeddings. This could mean that the textual information of the utterances help the model know when the prediction of a minority class is very unlikely.

The average confusion matrices per language ability level over all the se-quence lengths showed a great variation in prediction patterns, see Section 4.1.3.2 Figure 18. The patterns of Level 2 and 4 indicated the lowest per-formance accuracy compared to the other levels. There was no clear difference between the lower and higher levels as there was for the model without sentence embeddings, see Section 5.2.2.

5.3 Results Analysis

5.3.1 The Performance of Neural Language Models Compared to

the Baseline Performance

The average per-class f1-scores for all the models showed that the data set was too small and too imbalanced to obtain high accuracies. This was already to be expected, see Section 5.1. The results showed that there is not any significant difference in f1-score performance between the baselines, except between the random class and trigram model baselines, see Section 4.2.1. Neither is there any significant difference between the model with or without sentence embeddings, nor between the model with sentence embeddings for different sequence lengths, see Section 4.2.1. The results of Section 4.2.1 show that the model without sentence embedding is significantly better than all the baselines except for the trigram model baseline. A comparison of the baselines with the model with sentence embeddings showed that the model was better than the simple baselines

Neural Language Modelling of Dialogue Act Sequences in Varying Contexts