5.2 Discussion Utterance Classification
5.2.2 Results
Every utterance in the RfC-Predecessor discussion pairs is labeled with argumentative attributes in the three dimensions. In this section, we analyze each dimension separately to find prospective trends in the pairs of discussions. To remind the concepts behind each class of 3-dimensional argumentative attributes, we refer the reader to table 1. Table 12 demonstrates how utterances in the RfC-Predecessor discussions look after discussion utterance classification step.
Table 12: Labeled utterance examples in the RfC-Predecessor corpus
Utterance Act Relation Frame
Hi, I would please like some outside eyes... social-act neutral dialogue Should the lead of this article mention that...? questioning attack verifiability There is a proposal to move from Syro-Palest... understanding support writing
Figure 8 compares successful (RfC) and unsuccessful (Predecessor) discussions concerning the av-erage distribution of discourse act classes in discussions. It is clear that the most frequent class in both RfC and Predecessor discussions is the ”understanding” and the least frequent one is the social-actclass. In each RfC discussion (considered as successful), on average 43% of utterances are labeled as understanding, whereas in the Predecessor discussions, 51% of utterances are classified as understanding. The figure shows that in RfC discussions there are more ”recommendation” and
”finalization” actions than in their corresponding Predecessors. While in RfC conversations, 15%
of utterances are labeled as recommendation and 20% as finalization, in the corresponding prede-cessor discussions, these percentages are 9% and 15% respectively. Overall, the chart reveals that in successful discussions, recommendation and finalization actions occur more frequently, whereas, in unsuccessful discussions, there are more utterances labeled as understanding, questioning, and evidence.
Figure 9 compares the successful and failed discussions for argumentative relation attributes. It shows that over 55 percent of both RfC and Predecessor utterances are classified as neutral actions. Interest-ingly, successful discussions contain more support and fewer attack actions than failed discussions.
While in Predecessor discussions 21 percent of utterances in each conversation are labeled as attack, only 14 percent of utterances in RfC discussions are classified in this class. On the other hand, RfC conversations contain more utterances with a support label, with over 30 percent. In contrast, the percentage of support utterances in the predecessor discussions is 22%.
The distribution of frame attribute classes in RfC-Predecessor discussions is illustrated in Figure 10.
It shows that on average, 37 percent of utterances in RfC discussions are classified as dialogue, while
Average percentage of class occurances in each
Class labels
recommendation finalization evidence questioning social_act understanding
0 20 40 60
RfC Predecessor
Figure 8: Average distribution of discourse act classes in RfC-Predecessor pairs
Average percentage of class occurances in each discussion
Class labels
attack
support
neutral
0 20 40 60
RfC Predecessor
Figure 9: Average distribution of argumentative relation classes in RfC-Predecessor pairs
Average percentage of class occurances in each discussion
Class labels
dialogue
verifiability
neutral
writing
0 10 20 30 40
RfC Predecessor
Figure 10: Average distribution of frame classes in RfC-Predecessor pairs
the percentage is 28 for Predecessor discussions. In contrast, the verifiability class appears in 20 percent of RfC conversations and 28 percent of Predecessor discussions. The average distribution of neutraland writing classes are roughly similar, ranging between 32-35 percent.
Error analysis
In section 5.1.2, we performed a thorough error analysis using the labeled data that was provided in the Webis-WikiDebate-18 to identify common errors made by the classifiers. In this section, we examine how well the models classified utterances in the RfC-Predecessor discussions. To this end, we randomly choose a fraction of the utterances of the corpus and attempt to manually assess the performance of the three classifiers.
Discourse act We select 100 utterances from the labeled RfC-Predecessor pairs corpus to analyze how the model classified the texts in the discourse act dimension. In the manual exploration of the randomly selected samples, we noticed that 26% of them are too ambiguous for us as humans to classify them in a category of discourse act dimension. Many of the ambiguous texts are very short texts and titles, such as ”SAT Score request for comment”, ”Romanians = Vlachs”, ”requested move”, ”I removed it”, ”this article is load of crap”, or ”dispute resolution” are some examples of such ambiguous texts that we put them aside in our analysis.
Table 13 shows the confusion matrix of the remaining 74% of the samples. Based on this table, we calculate the model’s performance using precision, recall, and macro averaged F1-Score that are pre-sented in table 22. It shows that the model classified the samples with an F1-Score of 0.77 in the discourse act dimension. According to the table, the model performed the best in classifying the
social-act, understanding, and evidence classes with an F1-Score of 0.86 and 0.84, and 0.83, respec-tively. The table also shows that the model encounters some difficulties in predicting the questioning, recommendation, and finalization classes with 0.62, 0.72, and 0.77 F1-Score, respectively. At the same time, we noticed that 11% of mistakes are rooted in the fact that some utterances could be labeled with more than one class. This is mainly the case in questioning, social-act, and recommen-dationclasses. For instance, a text that contains a question, a social-act, or a recommendation, has some other sentences that can be categorized as evidence, understanding, or the like. Overall, the model is performing as expected during the training in the argumentation attribute classification step.
Table 13: Discourse act confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.
Predicted Classes Evidence Finalization Questioning Recommendation Social-act Understanding
ActualClasses
Evidence 10 1 0 1 0 1
Finalization 0 5 0 0 0 1
Questioning 0 0 4 2 0 2
Recommendation 1 0 1 9 0 1
Social-act 0 0 0 0 3 0
Understanding 1 2 0 1 1 27
Argumentative relation We also dive into the argumentative relation model to see how it classified the RfC-Predecessor pairs utterances. Again, we select 100 random utterances from the corpus and examine the model’s performance manually. We come across some texts (12% very short texts and 3% need for more context) that are difficult to categorize in a particular class, similar to discourse act dimension analysis. Table 14 is a confusion matrix that shows the model’s performance determining argumentative relation class of the remaining 88% samples. We calculate the precision, recall, and F1-Score that are shown in table 22. The model performs the best in classifying neutral utterances with a 0.84 F1-Score. The model is successful in associating neutral words like perhaps, unsure, please, reference,and questions (?) to the neutral class. However, the performance dropped to 0.63 and 0.66 when it comes to classifying attack and support classes, respectively. As shown in table 22, the model performed well in identifying attack with a precision of 0.83, but the number of False Negative attacks which are predicted as neutral decreased the overall performance. This is the other way around for the support class, with 0.75 recall and 0.63 precision. Overall, the model classified
the argumentative relation classes with a macro averaged F1-Score of 0.72, provided removing short texts in the corpus.
Table 14: Argumentative relation confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.
Predicted Attack Neutral Support
Actual
Attack 10 7 3
Neutral 2 46 4
Support 0 4 12
Frame In the frame dimension, we conducted error analysis in the same way as for the other two dimensions. Table 15 shows the confusion matrix of our manual analysis. From our analysis, we found that classifying ”dialogue” actions is the most difficult part of the task, where the model often confused between the ”dialogue”, ”verifiability”, and ”neutral” classes. This is primarily due to the lack of multi-labeling capability, where some texts can be categorized into multiple classes. Table 22 shows the class-level performance of the model. It shows that the model performed the best in classifying ”neutral” utterances with an F1-Score of 0.78 while classifying ”dialogue” actions with an F1-Score of 0.72 was the most challenging task. The model’s performance is lower than in the previous step, as on the Webis-WikiDebate-18 corpus it classified utterances with an F1-Score of 0.77 (See table 9), which is a 3% decrease in performance on the RfC-Predecessor corpus.
Table 15: Frame confusion matrix of the manual error analysis of the RfC-Predecessor pairs corpus after the discussion utterance classification step.
Predicted Classes Dialogue Neutral Verifiability Writing
ActualClasses
Dialogue 13 3 2 1
Neutral 1 26 4 0
Verifiability 3 3 24 3
Writing 0 4 1 12
Table 16: Error analysis results of manual evaluation of the models performance in classifying the RfC-Predecessor pairs corpus in three dimensions classes
Dimension/Classes Precision Recall F1-Score Discourse act
evidence 0.83 0.83 0.83
finalization 0.71 0.83 0.77
questioning 0.80 0.50 0.62
recommendation 0.69 0.75 0.72
social-act 0.75 1.00 0.86
understanding 0.84 0.84 0.84
macro avg 0.77 0.79 0.77
Relation
attack 0.83 0.50 0.63
neutral 0.81 0.88 0.84
support 0.63 0.75 0.66
macro avg 0.76 0.71 0.72
Frame
dialogue 0.76 0.68 0.72
neutral 0.72 0.84 0.78
verifiability 0.77 0.73 0.75
writing 0.75 0.71 0.73
macro avg 0.75 0.74 0.74