• No results found

5.2 Discussion Utterance Classification

5.2.2 Results

Every utterance in the RfC-Predecessor discussion pairs is labeled with argumentative attributes in the three dimensions. In this section, we analyze each dimension separately to find prospective trends in the pairs of discussions. To remind the concepts behind each class of 3-dimensional argumentative attributes, we refer the reader to table 1. Table 12 demonstrates how utterances in the RfC-Predecessor discussions look after discussion utterance classification step.

Table 12: Labeled utterance examples in the RfC-Predecessor corpus

Utterance Act Relation Frame

Hi, I would please like some outside eyes... social-act neutral dialogue Should the lead of this article mention that...? questioning attack verifiability There is a proposal to move from Syro-Palest... understanding support writing

Figure 8 compares successful (RfC) and unsuccessful (Predecessor) discussions concerning the av-erage distribution of discourse act classes in discussions. It is clear that the most frequent class in both RfC and Predecessor discussions is the ”understanding” and the least frequent one is the social-actclass. In each RfC discussion (considered as successful), on average 43% of utterances are labeled as understanding, whereas in the Predecessor discussions, 51% of utterances are classified as understanding. The figure shows that in RfC discussions there are more ”recommendation” and

”finalization” actions than in their corresponding Predecessors. While in RfC conversations, 15%

of utterances are labeled as recommendation and 20% as finalization, in the corresponding prede-cessor discussions, these percentages are 9% and 15% respectively. Overall, the chart reveals that in successful discussions, recommendation and finalization actions occur more frequently, whereas, in unsuccessful discussions, there are more utterances labeled as understanding, questioning, and evidence.

Figure 9 compares the successful and failed discussions for argumentative relation attributes. It shows that over 55 percent of both RfC and Predecessor utterances are classified as neutral actions. Interest-ingly, successful discussions contain more support and fewer attack actions than failed discussions.

While in Predecessor discussions 21 percent of utterances in each conversation are labeled as attack, only 14 percent of utterances in RfC discussions are classified in this class. On the other hand, RfC conversations contain more utterances with a support label, with over 30 percent. In contrast, the percentage of support utterances in the predecessor discussions is 22%.

The distribution of frame attribute classes in RfC-Predecessor discussions is illustrated in Figure 10.

It shows that on average, 37 percent of utterances in RfC discussions are classified as dialogue, while

Average percentage of class occurances in each

Class labels

recommendation finalization evidence questioning social_act understanding

0 20 40 60

RfC Predecessor

Figure 8: Average distribution of discourse act classes in RfC-Predecessor pairs

Average percentage of class occurances in each discussion

Class labels

attack

support

neutral

0 20 40 60

RfC Predecessor

Figure 9: Average distribution of argumentative relation classes in RfC-Predecessor pairs

Average percentage of class occurances in each discussion

Class labels

dialogue

verifiability

neutral

writing

0 10 20 30 40

RfC Predecessor

Figure 10: Average distribution of frame classes in RfC-Predecessor pairs

the percentage is 28 for Predecessor discussions. In contrast, the verifiability class appears in 20 percent of RfC conversations and 28 percent of Predecessor discussions. The average distribution of neutraland writing classes are roughly similar, ranging between 32-35 percent.

Error analysis

In section 5.1.2, we performed a thorough error analysis using the labeled data that was provided in the Webis-WikiDebate-18 to identify common errors made by the classifiers. In this section, we examine how well the models classified utterances in the RfC-Predecessor discussions. To this end, we randomly choose a fraction of the utterances of the corpus and attempt to manually assess the performance of the three classifiers.

Discourse act We select 100 utterances from the labeled RfC-Predecessor pairs corpus to analyze how the model classified the texts in the discourse act dimension. In the manual exploration of the randomly selected samples, we noticed that 26% of them are too ambiguous for us as humans to classify them in a category of discourse act dimension. Many of the ambiguous texts are very short texts and titles, such as ”SAT Score request for comment”, ”Romanians = Vlachs”, ”requested move”, ”I removed it”, ”this article is load of crap”, or ”dispute resolution” are some examples of such ambiguous texts that we put them aside in our analysis.

Table 13 shows the confusion matrix of the remaining 74% of the samples. Based on this table, we calculate the model’s performance using precision, recall, and macro averaged F1-Score that are pre-sented in table 22. It shows that the model classified the samples with an F1-Score of 0.77 in the discourse act dimension. According to the table, the model performed the best in classifying the

social-act, understanding, and evidence classes with an F1-Score of 0.86 and 0.84, and 0.83, respec-tively. The table also shows that the model encounters some difficulties in predicting the questioning, recommendation, and finalization classes with 0.62, 0.72, and 0.77 F1-Score, respectively. At the same time, we noticed that 11% of mistakes are rooted in the fact that some utterances could be labeled with more than one class. This is mainly the case in questioning, social-act, and recommen-dationclasses. For instance, a text that contains a question, a social-act, or a recommendation, has some other sentences that can be categorized as evidence, understanding, or the like. Overall, the model is performing as expected during the training in the argumentation attribute classification step.

Table 13: Discourse act confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.

Predicted Classes Evidence Finalization Questioning Recommendation Social-act Understanding

ActualClasses

Evidence 10 1 0 1 0 1

Finalization 0 5 0 0 0 1

Questioning 0 0 4 2 0 2

Recommendation 1 0 1 9 0 1

Social-act 0 0 0 0 3 0

Understanding 1 2 0 1 1 27

Argumentative relation We also dive into the argumentative relation model to see how it classified the RfC-Predecessor pairs utterances. Again, we select 100 random utterances from the corpus and examine the model’s performance manually. We come across some texts (12% very short texts and 3% need for more context) that are difficult to categorize in a particular class, similar to discourse act dimension analysis. Table 14 is a confusion matrix that shows the model’s performance determining argumentative relation class of the remaining 88% samples. We calculate the precision, recall, and F1-Score that are shown in table 22. The model performs the best in classifying neutral utterances with a 0.84 F1-Score. The model is successful in associating neutral words like perhaps, unsure, please, reference,and questions (?) to the neutral class. However, the performance dropped to 0.63 and 0.66 when it comes to classifying attack and support classes, respectively. As shown in table 22, the model performed well in identifying attack with a precision of 0.83, but the number of False Negative attacks which are predicted as neutral decreased the overall performance. This is the other way around for the support class, with 0.75 recall and 0.63 precision. Overall, the model classified

the argumentative relation classes with a macro averaged F1-Score of 0.72, provided removing short texts in the corpus.

Table 14: Argumentative relation confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.

Predicted Attack Neutral Support

Actual

Attack 10 7 3

Neutral 2 46 4

Support 0 4 12

Frame In the frame dimension, we conducted error analysis in the same way as for the other two dimensions. Table 15 shows the confusion matrix of our manual analysis. From our analysis, we found that classifying ”dialogue” actions is the most difficult part of the task, where the model often confused between the ”dialogue”, ”verifiability”, and ”neutral” classes. This is primarily due to the lack of multi-labeling capability, where some texts can be categorized into multiple classes. Table 22 shows the class-level performance of the model. It shows that the model performed the best in classifying ”neutral” utterances with an F1-Score of 0.78 while classifying ”dialogue” actions with an F1-Score of 0.72 was the most challenging task. The model’s performance is lower than in the previous step, as on the Webis-WikiDebate-18 corpus it classified utterances with an F1-Score of 0.77 (See table 9), which is a 3% decrease in performance on the RfC-Predecessor corpus.

Table 15: Frame confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.

Predicted Classes Dialogue Neutral Verifiability Writing

ActualClasses

Dialogue 13 3 2 1

Neutral 1 26 4 0

Verifiability 3 3 24 3

Writing 0 4 1 12

Table 16: Error analysis results of manual evaluation of the models performance in classifying the RfC-Predecessor pairs corpus in three dimensions classes

Dimension/Classes Precision Recall F1-Score Discourse act

evidence 0.83 0.83 0.83

finalization 0.71 0.83 0.77

questioning 0.80 0.50 0.62

recommendation 0.69 0.75 0.72

social-act 0.75 1.00 0.86

understanding 0.84 0.84 0.84

macro avg 0.77 0.79 0.77

Relation

attack 0.83 0.50 0.63

neutral 0.81 0.88 0.84

support 0.63 0.75 0.66

macro avg 0.76 0.71 0.72

Frame

dialogue 0.76 0.68 0.72

neutral 0.72 0.84 0.78

verifiability 0.77 0.73 0.75

writing 0.75 0.71 0.73

macro avg 0.75 0.74 0.74