Identifying (dis)agreement in dialogue

(1)

Identifying (dis)agreement in

dialogue

Jerke Jonas van den Berg 11034106

Bachelor thesis Credits: 18 EC

Bachelor’s programme Artificial Intelligence University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisor Dr. Julian Schl¨oder

Institute for Logic, Language and Computation University of Amsterdam

Science Park 107 1098 XG Amsterdam

(2)

Abstract

This paper replicates the model described in ”The role of polarity in inferring acceptance and rejection in dialogue” by Schlöder and Fernández (2014) to explore the use of logistic regression, as well as reduce the amount of features needed to produce a similar F1 score on (dis)agreement detection in spoken dialogue. To achieve similar results as the original Bernoulli naive Bayes model, downsampling acceptances in an imbalanced data set like the AMI meeting corpus proves to be crucial for the effectiveness of logistic regression. This paper reduces the set of features from the original model by Schlöder and Fernández (2014) to positive-negative polarity, negative-positive parallelism, checking for ”But ” and ”Yeah but ” in the response, acceptance cues and response length while barely losing performance measured in F1 score (60.77 compared to 60.86). When applying this new model to the ABCD corpus, it still performs well when compared to the work by Rosenthal and McKeown (2015), achieving an F1 score of 77 on disagreements and 27 on agreements. Sentiment features in the form of TextBlob did not prove to be a useful addition to this model.

1 Introduction

When having a discussion or debate, it can be hard to keep track of what people in the conversation have established as ’common ground’ when conversations are long or spread out into multiple time slots. During dialogue, a conversation can steer into different directions and capture multiple agreements and disagreements in the process. This is why it would be helpful to find some way of identifying these agreements and even disagreements, to steer a discussion in the right way and keep it relevant to what the people taking part in the discussion want to talk about.

At first glance, the task of identifying agreement and disagreement in dialogue might seem quite simple; look for obvious cues such as ”yes” or ”no” in the response to a proposal. But like many aspects in natural language, there are a lot of irregularities that will not be captured by simply looking for these words that signal (dis)agreement. For example, take a look at the next two proposal-response (P-R) pairs:

1. P) “I think that’s a bad idea.” R) “It’s good.”

2. P) “We should do that.” R) “We should.”

In the first pair, it is clear that the response disagrees with the proposal. This is not clear when looking at the response on its own; it does not contain a ”yes” or ”no” or something similar, nor does the response talk about agreeing with anything. Only when we include the proposal does it become clear that the response disagrees, because the proposal calls something ”bad” while the response says it is good. The second pair of a proposal and a response is similar, in that the agreement or disagreement is quite implicit. In this case

(4)

however, agreement is expressed by repeating a part of the proposal to acknowledge what was said.

From these two examples of rejection and acceptance of an utterance respectively, two things can be concluded. Whether or not a response is an acceptance or a rejection is relative to the proposal, and there is no finite number of words to look for to identify a response as agreeing or disagreeing, considering responses can rely on implicit strategies of conveying (dis)agreement such as seen in pair 2.

This paper focuses on capturing these implicit indications of acceptances and rejections into computable features to use in a machine learning model, and determines which of these features are the most decisive factor for agreement and disagreement identification. The data used to test these features is annotated spoken dialogue extracted from the AMI meeting corpus, where dialogue between people has been annotated to indicate acceptances and rejections of other participants proposals (McCowan et al., 2005).

The model described in the paper ”The role of polarity in inferring acceptance and rejection in dialogue” by Schlöder and Fernández (2014), explained in section 3, has been used as a starting point. Their model is expanded here by using data preprocessing in the form of Principle Component Analysis (PCA), downsampling and upsampling, which are described in section 4. Due to a ratio of acceptances to rejections of over 10:1, balancing out the data set by downsampling acceptances significantly improves the effectiveness of logistic regression. In section 5, a new sentiment feature in the form of TextBlob is explored, which does not significantly improve the model when analysed using a paired t-test (p>0.05). Finally, this paper concludes that the same F1 score of 61 can be achieved that was reached in the paper by Schlöder and Fernández (2014) while reducing the required amount of features from 15 to 6.

2 Related work

Due to the reasons mentioned in section 1 on why agreement and disagreement identification can be hard to automate, recent work in this field focuses on machine learning for this task (Rosenthal & McKeown, 2015; Schl¨oder & Fern´andez, 2014; W. Wang, Yaman, Precoda, Richey, & Raymond, 2011). These recent articles focus on the correlation between different extracted features between a proposal or utterance and a response, and a fitting classification algorithm to apply to these features. There is a significant variety in the approaches used in these papers based on the used data, the features extracted, and the classification algorithms applied to them.

W. Wang, Yaman, Precoda, Richey, and Raymond (2011) experimented with using conditional random fields for classification on a corpus of annotated broadcast conversations collected from the DARPA GALE program. Their features in the model include lexical, structural, durational, and prosodic features. Prosodic features consist of pause, duration, and the speech rate.

(5)

Since their data was highly imbalanced, they used downsampling to improve on the achieved F-score of the model. This approach resulted in an F1 score of 62 for agreement and 56 for disagreement. Whats interesting is that they also compare automatic annotation on their data to manual annotation done by humans. Annotators were hired to provide manual annotations of (dis)agreement to use for supervised training of the models. This showed that manual annotation is significantly better for disagreement, giving an F1 score of 50 compared to 47. For agreement, the difference was quite small; both round out to an F1 score of 57. Their results show that prosodic features can improve the model slightly, resulting in an increase of 0.6 in F1 score on average between agreement and disagreement detection. The most significant increase to their results was the use of downsampling to balance out the class distribution, which increased the F1 score of both agreement and disagreement detection by 5.

Schl¨oder and Fern´andez (2014) have researched the effect of polarity in identifying agreement and disagreement on the Augmented Multi-party Interaction (AMI) meeting corpus, a corpus that has collected spoken dialogue in meetings, and contains annotated responses of which 697 are rejections and 7405 are acceptances of an utterance (McCowan et al., 2005). This research indicates that relative polarity, whether an utterance is positive or negative, plays a significant role in distinguishing acceptances from rejections. For example, ”Yes it is.” as a response is not always an acceptance;

1. A: But it’s uh yeah it’s uh original idea. B: Yes it is.

2. A: the shape of a banana is not it’s not really handy. B: Yes it is.

(Schl¨oder & Fern´andez, 2014, p. 151)

These two examples are taken from the AMI corpus. It is clear to see that pair 1 shows agreement, while pair 2 is disagreement. This is because response 2B replies with yes to a proposal with negative polarity, while response 1B responds to a proposal with positive polarity.

The model has been implemented using a naive Bayes algorithm, as it outperformed Random Forests and a Support Vector Machine approach. Their model used local features like length, certain words for acceptance and rejection indicators like yeah and actually, and relative and local polarity. What is remarkable about the results that Schl¨oder and Fern´andez (2014) found, is that including local polarity in this model, which refers to the individual polarity of the proposal and the response, reduces its F1 score, while relative polarity significantly improves it. Relative polarity consists of four features that represent all possible combinations of the porposal and response polarity; positive-positive, positive-negative, negative-positive, negative-negative. Their model achieved an F1 score of 61 in the task of sorting disagreements from agreements.

Rosenthal and McKeown (2015) researched agreement and disagreement classification on corpora extracted from online forums such as the Agreement by Create Debaters (ABCD) corpus. This corpus differs from the AMI corpus

(6)

in that it contains written dialogue, extracted from the Create Debate website, instead of spoken dialogue converted into text, and its size of 38195 acceptances and 60991 rejections. This data set is also far less imbalanced compared to the AMI corpus. They included a third class called none, which includes ”neutral” responses in which the respondent neither agrees nor disagrees. Rosenthal and McKeown (2015) explored the influence of various different features, including sentence similarity between statements and responses. Contrarily to the theoretical motivation by Schl¨oder and Fern´andez (2014) to use polarity as a feature, their results showed that including polarity as a feature in their model slightly reduced performance, measured in F1 score. It also showed that for their data, logistic regression outperformed naive Bayes. On the ABCD corpus, their model achieved an F1 score of 26 for agreement and 57 for disagreement. Their none class performed the best with a score of 87.

A corpus used by both Rosenthal and McKeown (2015) and Misra and Walker (2017) is the Internet Argument Corpus (IAC). This corpus is similar to the ABCD corpus in that it is extracted from internet forums. Misra and Walker (2017) used this data to research the effect of using ngrams for both agreements and disagreements, as well as hedges, which are ways to be deliberately vague on a subject, such as ”perhaps”, ”essentially”, and ”I mean”. Misra and Walker (2017) explored both random forests and J48 trees to fit their model. Their best performing model used all theoretically motivated features, which consisted of ngrams indicative of both agreement and disagreement, cue words, hedges, duration, polarity and punctuation. Unfortunately, Misra and Walker (2017) did not report the F1 score achieved on the model that included all the features. J48 consistently outperformed random forests in this research, achieving an accuracy of 66.

Similarly to the research by Rosenthal and McKeown (2015) and Misra and Walker (2017), Allen, Carenini, and Ng (2014) uses a corpus consisting of online forum posts, this time from Slashdot.org. Their paper focuses on detecting disagreement, using Rhetorical Relation and Fragment Quotation Graphs, among other more common features such as sentence length and punctuation. Their best model achieved an F1 score of 76 on detecting disagreement, which included all their basic features like bias words and punctuation, and structural and sentiment based rhetorical features.

Another interesting paper on this topic using annotated data retrieved from forums is by L. Wang and Cardie (2016), who deal with features such as sentiment using a Conditional Random Field. Just like the paper by Rosenthal and McKeown (2015), their model included the third class none, which they called neutral. These models that include a third class are difficult to compare to other papers where only agreements and disagreements are taken into account, as adding a third class makes it more challenging to get a high F1 score. The model created by L. Wang and Cardie (2016) included a measure called ’Soft F1’, which scores agreement and disagreement labels adopted from the turn-level annotation without having their own (dis)agreement label more freely, treating instances predicted as neutral as true positives in this case. In soft F1, the model scored 74 on agreement, 67 on disagreement and 91 on

(7)

neutral.

Features that recur among the majority of these papers are punctuation (whether a response contains exclamation points or question marks) and length of the response. Length of a response in particular implies to be a useful feature, as there is a strong correlation between a long response and disagreement (Brown & Levinson, 1987). Unfortunately punctuation is less present in data consisting of spoken dialogue; when respondents type out their response on online forums, they might use ways of expressing themselves like adding multiple exclamation points or words written in all capital letters to emphasise their point. This is not the case in data like the AMI meeting corpus.

3 The original model

As mentioned in the introduction, this paper focuses on the AMI meeting corpus for model training and testing. Therefore, the model described in the paper by Schlöder and Fernández (2014) has been used to start off with. It allows for testing the effectiveness of preprocessing the data and implementing the features in a logistic regression model while offering a solid baseline to compare to. The F1 score achieved from this model in this paper is 60.83, close to the F1 score of 60.96 reported by Schlöder and Fernández (2014). The difference in score can be explained by a different seed to randomise the order of the data set. The F1 scores reported in both this and the paper by Schlöder and Fernández (2014) are all on identifying rejections unless mentioned otherwise, as the data highly favours acceptances compared to rejections with 6672 acceptances and 650 rejections. Because of this data imbalance, improving on the F1 score for rejections is the more challenging task, which is the focus of this paper. The F1 score on acceptances is 95.84.

3.1 Parsing the AMI corpus

The parsing of the data has been kept the same as the original model. Each entry in the AMI corpus is annotated with the speaker (speaker A, B, C or D), certain sentence tags to describe the type of sentence the entry contains, the timestamp of the sentence and the timestamp to which it refers (if it is a response). Sentences that are tagged as assessments are included in the data set, unless they are tagged as Elicit Inform or Elicit Assessment, because these are questions. These are not suitable for training or testing the model, as a response to a question would not indicate agreement or disagreement to the proposal. Responses to earlier utterances are annotated by POS to indicate agreement with the utterance it responds to, and NEG for disagreement. The data has been randomly shuffled to create an even data representation between the training and test sets. Every test described in this paper using this model on the AMI corpus has been performed using 10-fold cross-validation, as the data set only consists of 7322 total entries.

(8)

Figure 1: All the features used from the original model (Schl¨oder & Fern´andez, 2014, p. 154,155,157)

3.2 Features

The same features as the features described by Schl¨oder and Fern´andez (2014) are used in this model. See figure 1 for an overview of the features used. The local polarity features have been left out, as these drastically decrease the F1 score, from 60.83 to 52.73. All the features are Boolean in this initial model, as it is designed for a Bernoulli naive Bayes model which only uses true or false as features. The length feature that captures the length of the response has been divided into three features for this reason; one for a length greater than 2, one greater than 12 and one greater than 24. Furthermore, the model includes features that hold true if the response contains any indication words for acceptance or rejection like yeah or but, 4 relative polarity features that capture the combination of the proposal and response combined (positive-positive, positive-negative, negative-positive and negative-negative) and a relative parallelism feature. Parallelism refers to a grammatical pattern between the proposal and response. For example, if the proposal is ”I think this is good.” and the response to that is ”I think it isn’t.” this would indicate to rejection when analysing it through parallelism.

(9)

3.3 Logistic regression

The naive Bayes training algorithm relies on conditional independence for the features used. This is however not the case for the features described in this model; of the four features capturing relative polarity for example, one of them is true and the rest is false on any given entry of the data set. A paper by Ng and Jordan (2002) argues that when a training set approaches infinity, logistic regression will consistently outperform naive Bayes in performance. Though the AMI corpus is far from infinite, it is useful to research the use of this training algorithm and if it can outperform the Bernoulli naive Bayes in this case. Furthermore, moving away from Bernoulli naive Bayes allows for the use of more dynamic features that do not have to be only true or false. This enables the use of features that contain scales, and the use of PCA, which relies on normalised data with a mean of 0 to work.

4 Data preprocessing

Running logistic regression on the model without any preprocessing results in drastically lower performance compared to Bernoulli naive Bayes, resulting in an F1 score of 48.93 compared to 60.83. However, when looking at earlier work on this topic by W. Wang, Yaman, Precoda, Richey, and Raymond (2011), downsampling or upsamling the data to adjust data imbalance can have a significant impact on performance. In the case of the AMI corpus, data imbalance is significant as the proportion of acceptances to rejections is over 10:1. In the next section, both removing acceptances to downsample and duplicating rejections to upsample are documented.

4.1 Downsampling and upsampling

Figure 2 shows that logistic regression benefits greatly from balancing out the training set from downsampling. After removing 65% of the acceptance entries, the score for logistic regression goes up from 48.93 to 61.25 (tested on 10-fold cross validation). The naive Bayes model improves slightly over the first 25%, but only from 60.82 to 61.06. Running a paired t-test shows this improvement is insignificant (p>0.05). The same holds for the logistic regression model 65% downsampled compared to the original naive Bayes model (p>0.05).

Upsampling the rejections leads to almost the same score, resulting in an F1 score of 60.83 for the logistic regression model. This score is achieved at a duplication of 180% of the rejection entries, which means that on average a rejection entry is trained 2.8 times. Both upsampling and downsampling indicate that the optimal ratio of acceptances to rejections is around 3:1. This ratio keeps the accuracy achieved relatively high while drastically increasing the recall of rejections during testing.

(10)

Figure 2: The F1 scores per percentage of upsampling and downsampling the training sets, tested in 10-fold cross-validation

4.2 Principle Component Analysis

At first, the goal of using PCA was to look at the covariance matrices that it generates to gain insight on which features are the most important to train the model on. This proved to be difficult to decipher however, as each dimension in a PCA converted data set relies on multiple features at once. Instead, using the built-in attribute from the Scikit-learn library to estimate the coefficients of each feature has been used for this. For these results, refer to section 6.

While experimenting with applying PCA on the model, another use for PCA was discovered. Because it reduces the dimension size of the amount of features used while combining multiple features into one dimension, this allows features that correlate to combine. Using this method to allow for less noise before the data is actually trained could be useful.

(11)

Figure 3: The F1 scores per amount of PCA dimensions used, tested in 10-fold cross-validation

downsampling results in the graph shown in figure 3, where the x-axis represents the amount of dimensions used in the PCA-converted data. Only logistic regression is shown in this case, as PCA calls for normalized data which does not work well on a Bernoulli naive Bayes algorithm, as it interprets any value bigger than 0 as 1 and any value below that as 0. What can be seen in this figure is that reducing the dimensions of the feature by 2 still results in around the same F1 score. In fact, at 13 dimensions, the average F1 score over 10-fold cross validation goes up from 61.25 to 61.51. Unfortunately, this increase again proves to be insignificant (p>0.05). However, this graph raises a conjecture that the model could work with less features, and that some features used earlier might be either insignificant or could be combined with other existing features in the model.

5 TextBlob as a sentiment feature

A feature that could work great to add to a polarity based model like this one is sentiment. Sentiment expresses the view or opinion an individual has base or their utterance; words like ”good, amazing, cool ” have a positive sentiment, while words like ”bad, terrible, gross” hold negative sentiment.

TextBlob is a Python library which allows the user to calculate sentiment from a sentence. The polarity function returns a float number between -1 and 1, where -1 is completely negative sentiment and 1 positive. This could be useful in the case of agreement and disagreement detection, as it allows to check the sentiment of both the proposal and the response to compare them.

>>> TextBlob("I believe that’s a bad idea.").sentiment.polarity -0.70

>>> TextBlob("I think that’s actually great.").sentiment.polarity 0.8

(12)

and disagreement detection; The proposal, ”I believe that’s a bad idea.” returns a negative sentiment of -0.7, while the response, ”I think that’s actually great.” returns 0.8. This contrast between the proposal and the response shows that they do not agree, which turns out to be true.

This example shows why sentiment is important in (dis)agreement detection very similarly to polarity; you can have have a very positive response, like ”I think that’s actually great.”, and still have a pair that represents disagreement because the proposal was actually negative, in this case ”I believe that’s a bad idea.”

To test this in the existing model, three features have been added. The first two are the sentiment of the proposal and the sentiment of the response. If these values are between -0.1 and 0.1, their value becomes zero to remove insignificant entries. The third aims to capture the relative sentiment between the two, by doing the following:

1. If the sentiment of both the response and the proposal are not equal to zero, multiply them by each other.

2. If this is not the case, take the sentiment value that is not zero. 3. If both are zero, the features takes on the value of zero.

If this feature is applied to the TextBlob example mentioned earlier, this feature would take on a value of -0.56 which would point to disagreement. The same value would come out if the sentiment values were to be reversed, for example in a proposal-response pair with a positive proposal and a negative response. If either the proposal or the response returns a sentiment of zero, this would indicate a more neutral sentence. In this case, multiplying them with each other would result in zero, so it would be better to take the utterance that returns a sentiment.

Testing this did not result in any significant improvement (p>0.05), where any combination of proposal, response and relative polarity extracted from TextBlob added to the model would fluctuate the F1 score around 61. Using PCA brought this score back up to 61.5, which points towards these features not adding any value to the model as the score increases by bringing the dimensions of the model down.

The reason for this might be that the already present features for relative polarity already captures this correlation between polarity and agreement and disagreement quite well. To add to this conjecture, the original polarity features not only look for the general sentiment of an utterance, but also looks for words that indicate agreement or disagreement without them changing the sentiment of the sentence. For example, running TextBlob on words like ”never” returns a sentiment of 0, while the original relative polarity features capture ”never” as being a negative utterance.

(13)

Feature Coefficient Negative-positive pattern (using parallelism) 2.02

Positive-negative relative polarity 1.68 Acceptance cues (yeah, okay, accept, ...) 1.66 Response contains the bigram ”Yeah but ” 1.58 Response sentiment (TextBlob) 1.47 Rejection cues (actually, but, although, ...) 1.42 Length >24 (of the response) 1.41 Response contains ”Yes” 1.03 Response contains the bigram ”No no” 0.98 Proposal sentiment * response sentiment (TextBlob) 0.96 Length >12 (of the response) 0.94 Response contains ”But ” 0.92 Positive-positive relative polarity 0.92 Length >2 (of the response) 0.82 Negative-negative relative polarity 0.60 Proposal sentiment (TextBlob) 0.28 Negative-positive relative polarity 0.15

Hedges 0.15

Figure 4: The coefficients of every feature in the model trained on the logistic regression model

6 Results

6.1 Feature analysis

To analyse the usage of each feature in the model including the TextBlob features described in the previous section, a Scikit-learn attribute that shows the coefficients of the features in the trained model has been used to make figure 4. This ranks every feature used by the model based on how much it is used to determine agreement or disagreement from most to least. An interesting observation is that the positive-negative relative polarity feature scores very well at 1.68, while the negative-positive relative polarity has one of the lowest scores. Negative-negative polarity patterns also score low as a coefficient of the model, which points towards that entries that contain negative proposals are harder to predict on agreement or disagreement regardless of the response. A response can agree with the negative proposal and in turn result in a positive response, or a response can repeat what the proposal said and be negative. To illustrate this thought, here are two examples:

1. P) ”That is a horrible idea.” R) ”I agree.”

2. P) ”That is a horrible idea.” R) ”It is actually terrible.”

Both pairs in this example express agreement between the proposal and the response. However, in the current model, pair 1 would result in a

(14)

negative-positive pattern in the relative polarity feature, while pair 2 would result in a negative-positive pattern. When looking at pairs that contain a positive proposal, it is much less likely to find a positive-negative pattern that agrees or a positive-positive pattern that disagrees. Another remarkable coefficient is the response sentiment extracted using TextBlob; though this feature did not improve the model in any way (p>0.05), it does score high in coefficiency when included in the model.

6.2 Feature reduction

To test the use of the best scoring features, only the six highest scoring features based on the coefficients in figure 4 have been used (excluding TextBlob) to (re)train the model to see if a similar F1 score can be achieved to the original logistic regression model (61). The features included in this subset are parallelism, positive-negative relative polarity, acceptance cues, the ””Yeah but”” bigram, whether the response contains ”but ”, and the three length features combined into one feature (1 if length >2, 4 if length >12, 7 if length >24). This last feature was tested using multiple scaling steps, and this scale performed the best in F1 score. Though the ”But ” feature scores far below the top six features based on coefficients, it has been included in this set because it significantly improved the F1 score (p>0.05)

This model results in an F1 score of 60.77. Comparing this to the original (65% downsampled) logistic regression model, this slight decrease is insignificant (p>0.05). What is remarkable about the acceptance and rejection cues is that the acceptance cues feature significantly increases the F1 score of the model (p<0.05) while the rejection cues insignificantly increases the F1 score (p>0.05), even though they both score within the six most relevant features based on their coefficients shown in figure 4. Out of the four relative polarity combination features, positive-negative drastically increases the F1 score from 38.85 to 60.77. Adding any other relative polarity feature to this one results in an insignificant increase (p>0.05). In fact, adding all four relative polarity features results in an insignificant increase when compared to only the positive-negative polarity feature (p>0.05). Removing either the ”but ” or ”yeah but ” feature results in a significant decrease (p<0.05).

An interesting result from the length feature is that it did not significantly increase the performance of the model (p>0.05), yet the average F1 score goes up from 59.94 to 60.77 when it is included. Even adding the original three separate features consisting of the response being longer than 2, 12, and 24 did not significantly increase the performance of the model when compared to leaving all length features out (p>0.05). This feature, when compared to previous work (see section 2), is normally always included because longer responses tend to be negative as respondents often feel the need to argue why when they disagree with a proposal.

Analysing this ”new” model using PCA shows that reducing the dimensions of this feature set only significantly decreases the models F1 score at 2 dimensions, which indicates that there is still room for feature reduction in the model, at least from these 6 features to 3. There most likely would be a

(15)

way of combining the ”but ” feature with the ”yeah but ” bigram, at least when using logistic regression, at this allows for more scales than just 0 or 1 when it comes to the Bernoulli naive Bayes model.

Feature Coefficient Positive-negative relative polarity 4.97 Negative-positive pattern (using parallelism) 4.94 Response contains the bigram ”Yeah but ” 4.14 Response contains ”But ” 3.90 Acceptance cues (yeah, okay, accept, ...) 3.33 Response length 0.89

Figure 5: The coefficients of every feature in the subset of most important features (tested on the AMI corpus)

As expected, the new subset has much higher coefficients per feature than the original model (shown in figure 5), as it achieves a very similar F1 score with significantly less features. The earlier observations about the length feature are also in line with these coefficients, as it is drastically lower than all the other features included in this subset. This model still performs well in identifying acceptances, with an F1 score of 96.11.

7 Testing on the ABCD corpus

To test whether this new subset of features is not just specifically tailored to the AMI corpus, the Agreement by Create Debaters (ABCD) corpus has been tested using the same set of features described in figure 5. This corpus is quite different from the AMI corpus in that it contains (online) written dialogue between users on a forum, where people can ask a question that leads to comments debating with each other about agreeing and disagreeing.

7.1 Parsing

As each forum thread posts a question, each orginal post is not included to be a part of a proposal. Instead, each comment that includes a response to that comment that choose an opposite side in the debate is used as an entry to the data set. For example, if a user asks the question ”Is pizza healthy?”, this thread could contain a comment annotated with the side of ”no”, where yet another user can reply to this comment picking the side of ”yes”. This pair of two comments is then interpreted as a proposal-response pair. This results in 96154 rejections and 51402 acceptances. As the data set is quite large in this case, 10-fold cross validation is not used to measure performance. Instead, the data is randomly shuffled and divided into 80% being the training set and 20% being the test set. As this data is divided in a ratio between disagreements and agreements of roughly 2:1, applying any form of down or upsampling is not needed.

(16)

Feature Coefficient Positive-negative pattern (using parallelism) 4.99

Positive-negative relative polarity 4.98 Response contains the bigram ”Yeah but ” 4.15 Response contains ”But ” 3.91 Acceptance cues (yeah, okay, accept, ...) 3.37 Response length 0.91

Figure 6: The coefficients of every feature in the subset of most important features (tested on the ABCD corpus)

7.2 Results

The coefficients of the features, shown in figure 6, are almost identical to the ones extracted from the AMI corpus testing. Again, the response length feature is fairly insignificant to the model. The F1 score on identifying disagreements is quite high at 76.82. However, agreement detection is performing worse, resulting in an F1 score of 35.74. This is mostly due to the low recall of acceptances at 27.17. The reason for this might lie in the observation that in online written discussions, how a response is formulated can vary greatly. When disagreeing, users often just give a counter argument that might not give any explicit indication of disagreement in it. This entry taken from the ABCD corpus is a good example of this:

1. P) ”So you reject any logic that goes against God that is logically sound. You don’t care about logic, you care about God only. The logic you support is only God logic, not complete logic.”

R) ”God is complete logic. Anything else is wholly inadequate.”

Here, the respondent disagrees, but in an implicit way; instead of mentioning anything about agreeing or disagreeing, the respondent just formulates a different reasoning. There is a slight negative-positive pattern to be detected here, consisting of ”not complete logic” and ”is complete logic”. Unfortunately the positive-negative pattern feature does not pick this up in this case.

When comparing these results to related work done by Rosenthal and McKeown (2015) using the ABCD corpus, this model performs better in terms of F1 score, with 77 compared to 56 for disagreements and 27 compared to 26 for agreements. However, their model included the third class called None, which captures proposal-response pairs where the respondent neither agrees nor disagrees. This might make it harder to achieve a higher F1 score in their model.

8 Discussion

This paper shows an in-depth analysis of the significance of each feature used in the paper by Schl¨oder and Fern´andez (2014), and manages to keep the F1 score close to its original while reducing the feature sets dimensions from 15 to 6. The six features that proved to be the most useful from the original model

(17)

were positive-negative polarity, negative-positive parallelism, checking for ”But” and ”Yeah but” in the response, acceptance cues and response length. Contrarily to other related work, length in this model proves to contribute less than would be expected, as it improves the F1 score with less than 1% and is insignificant in a paired t-test (p>0.05). Using downsampling to balance out imbalanced data sets such as the AMI corpus proves to be crucial for making a logistic regression approach viable, as it drastically improves the F1 score in these cases from 49 to 61.

When testing features in this model, combining PCA to analyse the amount of dimensions needed in the feature set and determining the coefficients of each feature that the model contains is a useful tool. These two techniques allow to make conjectures about the usefulness of each feature, to then test on a new (sub) set of features. However, testing each feature individually and their contribution to the performance of the model is still needed to draw any conclusions. When testing the new subset of features described in section 6.2, not every feature with a high coefficient in the original model was relevant for the reduced feature set model. This shows that the relevance of any feature added can be greatly dependent on the features already existing in the model. For example, the TextBlob sentiment features did not provide any increase to the performance of the model when added, while response sentiment still scores high in the list of coefficients.

When testing the final model on written online discussions in the form of the ABCD corpus, which is quite different from the spoken dialogue it is designed for, it still performs well compared to other related work (Rosenthal & McKeown, 2015). This indicates that the six most important features found in this model are truly general, as they perform well for the same task on completely different data sets.

There are certain drawbacks to analysing spoken dialogue converted into a written data set. Some information that is quite relevant to the task of identifying agreement and disagreement gets lost in the process, like the tone of voice. This makes utterances like ”Why not” hard to interpret; it can either be a genuine question, which would indicate disagreement, or it might be a rhetorical question in which case the respondent agrees with the proposal. Other instances like sarcasm also become in some cases impossible to identify.

The final model also has some issues with identifying double negatives and interpreting them as a positive. These are two examples from the list of false negatives, which are acceptances that the model identified as rejections:

1. P) ”uh you got some pretty cheap labour that can do this case for one euro”

R) ”that’s not bad”

2. P) ”i just lost my microphone” R) ”no problem”

Both responses from these two examples negate a negative thing in their response with no or not. As a result, the positive-negative relative polarity

(18)

feature becomes true in both instances, making the model classify them as rejections. To fix this issue, sentiment could be combined with the polarity feature(s) to check for the sentiment of the next word after a negation like no or not to check if either a positive or negative word is negated.

The use of PCA improves the F1 score in some cases, but in all tests done during the research in this paper result in an insignificant increase (p>0.05). However, there might be a use for PCA in certain instances to reduce noise in a dataset, or to combine features that strongly correlate into one. Furthermore, cases where the proposal is negative are still hard to capture in this model. Relying on sentiment in these cases is not consistent, as shown with the examples in section 6.1. This research is left for future work.

References

Allen, K., Carenini, G., & Ng, R. (2014, October). Detecting disagreement in conversations using pseudo-monologic rhetorical structure. In Proceedings of the 2014 conference on empirical methods in natural language processing (emnlp) (pp. 1169–1180). Doha, Qatar: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/D14-1124 doi: 10.3115/v1/D14-1124

Brown, P., & Levinson, S. C. (1987). Politeness: Some universals in language usage (Vol. 4). Cambridge university press.

McCowan, I., Carletta, J., Kraaij, W., Ashby, S., Bourban, S., Flynn, M., . . . Wellner, P. (2005). The ami meeting corpus. In L. Noldus, F. Grieco, L. Loijens, & P. Zimmerman (Eds.), Proceedings of measuring behavior 2005, 5th international conference on methods and techniques in behavioral research (pp. 137–140). Noldus Information Technology.

Misra, A., & Walker, M. A. (2017). Topic independent identification of agreement and disagreement in social media dialogue. CoRR, abs/1709.00661 . Retrieved from http://arxiv.org/abs/1709.00661 Ng, A. Y., & Jordan, M. I. (2002). On discriminative vs. generative classifiers:

A comparison of logistic regression and naive bayes. In Advances in neural information processing systems (pp. 841–848).

Rosenthal, S., & McKeown, K. (2015). I couldn’t agree more: The role of conversational structure in agreement and disagreement detection in online discussions. In Proceedings of the 16th annual meeting of the special interest group on discourse and dialogue (pp. 168–177). Prague, Czech Republic: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W15-4625 doi: 10.18653/v1/W15-4625

Schl¨oder, J., & Fern´andez, R. (2014, June). The role of polarity in inferring acceptance and rejection in dialogue. In Proceedings of the 15th annual meeting of the special interest group on discourse and dialogue (sigdial) (pp. 151–160). Philadelphia, PA, U.S.A.: Association for Computational Linguistics. Retrieved from https://www.aclweb.org/anthology/W14-4321 doi: 10.3115/v1/W14-4321

(19)

Wang, L., & Cardie, C. (2016). Improving agreement and disagreement identification in online discussions with A socially-tuned sentiment lexicon. CoRR, abs/1606.05706 . Retrieved from http://arxiv.org/abs/1606.05706

Wang, W., Yaman, S., Precoda, K., Richey, C., & Raymond, G. (2011). Detection of agreement and disagreement in broadcast conversations. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies: Short papers - volume 2 (pp. 374–378). Stroudsburg, PA, USA: Association for Computational Linguistics. Retrieved from http://dl.acm.org/citation.cfm?id=2002736.2002813

Identifying (dis)agreement in dialogue