Results - Binary Classification - Identifying Effective Deliberation Strategies in Wikipedia Ta

4.4 Binary Classification

5.1.2 Results

["understanding"]])

In the fifth step, we use the PLM, template, and verbalizer created in previous steps to create a prompt model. The PromptForClassification class is used to do this step.

prompt model = PromptForClassification(plm=plm,template=mytemplate, verbalizer=myverbalizer)

In the next step, we define a DataLoader using PromptDataLoader class. We set the maximum sequence length to 512, and a batch size of 4. It is important to note here that we use T5 PLM for classification, but we do not use its decoder. We only pass "pad" "extra id 0", and "eos" to decoder. So, we set the decoder’s maximum length to 3 to save the usage space. Finally, we train the model in a standard way. We use cross entropy loss as a loss function. Using Adam optimizer with a learning rate of 1e-4 we train the model over Webis-WikiDebate-18 corpus in 4 epochs.

Evaluation metrics

As shown in table 6, the training data in all three dimensions is imbalanced. Thus, we evaluate our classifiers using macro averaged F1-Score, precision, and recall as evaluation metrics.

Table 7: The results of discourse act attribute classifiers, using BERT, RoBERTa, and T5 in prompt-based learning

Model/Classes Precision Recall F1-Score Baseline

evidence 0.75 0.42 0.54

finalization 0.52 0.86 0.64

questioning 0.00 0.00 0.00

recommendation 0.00 0.00 0.00

social-act 1.00 0.13 0.24

understanding 0.56 0.70 0.62

macro avg 0.47 0.35 0.34

BERT

evidence 0.92 0.71 0.80

finalization 0.81 0.81 0.81

questioning 0.59 0.84 0.70

recommendation 0.46 0.50 0.48

social-act 0.70 0.47 0.56

understanding 0.66 0.76 0.71

macro avg 0.69 0.68 0.68

RoBERTa

evidence 0.80 0.66 0.73

finalization 0.76 0.80 0.78

questioning 0.55 0.89 0.68

recommendation 0.80 0.17 0.28

social-act 0.67 0.40 0.50

understanding 0.61 0.75 0.68

macro avg 0.70 0.61 0.61

Prompting+T5

evidence 0.82 0.62 0.71

finalization 0.74 0.79 0.77

questioning 0.88 0.79 0.83

recommendation 0.50 0.38 0.43

social-act 0.50 0.27 0.35

understanding 0.59 0.76 0.66

macro avg 0.67 0.60 0.63

Argumentative relation classifier

Table 8 presents the detailed results of the classifiers developed for the ”argumentative relation” at-tribute. The fine-tuned BERT model exhibits the best performance with a macro-averaged F1-Score of 0.64. However, the performance of the other two models is relatively close to that of BERT, with F1-scores of 0.63. For all three models, the most difficult class to classify is the ”attack” class. The Prompting approach performed similarly to BERT in classifying ”attack” with an F1-score of 0.58, while the fine-tuned RoBERTa model achieved an F1-score of 0.57. In contrast, BERT and RoBERTa classify the ”neutral” class with an F1-score of 0.65, outperforming the prompting approach with an F1-score of 0.64. All three models achieve the same performance in classifying ”support” ut-terances with an F1-score of 0.67. Overall, all three transformer-based models demonstrate similar performance in argumentative relation classification, surpassing the baseline by over 10%.

Frame classifier

The frame classifiers exhibit better performance due to the availability of a large amount of labeled data. Table 9 shows that all the three models outperform the baseline by at least 11 %. The fine-tuned RoBERTa model outperforms the other models with a macro-averaged F1-score of 0.79, and the BERT model demonstrates the second-best performance with an F1-score of 0.77. Our findings confirms that the prompting approach is not as effective when applied to large datasets, as previously reported in [85]. Specifically, our prompting model achieved an F1-Score of 0.74. When examining the performance at the class level, it was observed that the RoBERTa model achieved the best results in classifying all categories. Our analysis of comparing the performance of BERT and RoBERTa at the class level, revealed that both models performed similarly in certain classes. For example, RoBERTa classified ”dialogue” and ”neutral” classes with F1-Scores of 0.82 and 0.74, respectively, while the BERT model achieved scores of 0.80 and 0.73 in the same classes. In contrast, when it comes to classifying ”verifiability” and ”writing” classes, RoBERTa performed better than BERT.

Specifically, RoBERTa achieved F1-Scores of 0.81 and 0.80 in classifying these classes, while BERT achieved scores of 0.77 for both classes. Additionally, our prompting model classified ”dialogue”,

”verifiability”, ”writing”, and ”neutral” classes with F1-Scores of 0.78, 0.78, 0.72, and 0.71, respec-tively.

Error analysis

In this section, we investigate the misclassified data in our best performing models to uncover the reasons behind common mistakes made by the developed models in the argumentative attribute clas-sification stage.

Discourse act We analyze the discourse act classifier errors and categorize them into four differ-ent groups. To this end, we randomly selected nearly 50% of misclassified records in the data. Our analysis revealed that 40% of the data we examined could be assigned to more than one class. This indicates that while the predicted class may seem reasonable, the model struggles to predict the actual class due to the fact that the training data is only labeled with one class. This highlights the limitation of the training data being only labeled by one class, resulting in the model’s inability to make accurate predictions when the data could belong to multiple classes (See table 13 for examples). On the other hand, 28% of samples are mislabeled. For instance, the text ”thank you. I really appreciate your input on this matter.” was labeled as evidence, but the model predicted it as social-act, which is a more logical classification. Another example in mislabeled records is the text ”studio chief of walt disney

pictures or the animation studios ?”, which was labeled as understanding, but is classified correctly as questioning by the model. It is also the case for the text ”is there a reference for the first recorded use of the name clydesdale?” that has an understanding label, but is classified as evidence, mainly due to the presence of the reference token. In 12% of incorrectly classified samples, the model classified the texts based on individual words without considering the context. Furthermore, punctuation can aid in classifying a text as a question, but it can also confuse the model, leading to 12% of samples being misclassified due to misleading punctuation. However, according to our experiments (See ap-pendix A), punctuation removal does not improve the model’s performance. Lastly, we categorize the remaining, 8% of errors in unclear text category, where it is difficult to classify them even for a human.

Table 8: The results of argumentative relation attribute classifiers, using BERT, RoBERTa, and T5 in prompt-based learning

Model/Classes Precision Recall F1-Score Baseline

attack 0.52 0.44 0.48

neutral 0.56 0.49 0.52

support 0.55 0.67 0.60

macro avg 0.54 0.53 0.53

BERT

attack 0.59 0.57 0.58

neutral 0.68 0.63 0.65

support 0.65 0.71 0.67

macro avg 0.64 0.63 0.64

RoBERTa

attack 0.63 0.52 0.57

neutral 0.72 0.59 0.65

support 0.60 0.77 0.67

macro avg 0.65 0.63 0.63

Prompting+T5

attack 0.62 0.55 0.58

neutral 0.65 0.64 0.64

support 0.63 0.71 0.67

macro avg 0.63 0.63 0.63

Table 9: The results of frame attribute classifiers, using BERT, RoBERTa, and T5 in prompt-based learning

Model/Classes Precision Recall F1-Score Baseline

dialogue 0.62 0.60 0.61

neutral 0.62 0.60 0.61

verifiability 0.66 0.73 0.70

writing 0.72 0.54 0.62

macro avg 0.66 0.62 0.63

BERT

dialogue 0.81 0.80 0.80

neutral 0.67 0.79 0.73

verifiability 0.82 0.72 0.77

writing 0.80 0.75 0.77

macro avg 0.78 0.77 0.77

RoBERTa

dialogue 0.84 0.81 0.82

neutral 0.78 0.69 0.74

verifiability 0.77 0.85 0.81

writing 0.78 0.81 0.80

macro avg 0.80 0.79 0.79

Prompting+T5

dialogue 0.80 0.76 0.78

neutral 0.72 0.69 0.71

verifiability 0.74 0.82 0.78

writing 0.79 0.66 0.72

macro avg 0.76 0.73 0.74

Table 10: Error categories in the fine-tuned BERT discourse act classifier. Dark red highlighted words are more associated to the predicted class.

Error category Label Prediction Text

mislabeled

understanding questioning evidence social-act

understanding evidence understanding evidence

multi-label

recommendation understanding

recommendation understanding evidence understanding

Punctuation

recommendation understanding

recommendation understanding Contextless words recommendation understanding

Argumentative relation We also take a closer look at misclassified records in the argumentative relation test data to find the reason behind the results. We randomly choose 10% of misclassified texts from test data and analyze the main categories of common errors. Table 11 demonstrates some examples of each category. We found that 27% of the explored samples discussed both sides’ points of view separated by conjunction words like although, whereas, however, or but. This makes the model confused to classify the text as Support or as Attack. Our analysis yields 19% of cases are mislabeled and the model predicted the class correctly. For example, as shown in table 11, the text: ”there is not nearly enough information specific to each model of phone, yet adding more information to this page would be too much” is predicted as an Attack, but the label is Support. It turned out that in 12% of cases, the model made decisions based on individual words regardless of context. For instance, the neutral utterance: ”I have reverted the move given the obvious controversy it would generate ...”, is predicted as Attack based on the presence of the ”controversy” word. Additionally, 12% of explored samples are short texts, and classifying them requires more context from previous utterances in the discussion. For instance, in the short text: ”as per above arguments.”, the model predicted the Support class, whereas the label is Attack. However, after we removed short texts with less than 10 words, the overall performance decreased by 6% (See appendix A). On the contrary, in 8% of examples the model made mistakes, where different topics are discussed in very long texts. Lastly, 4% of errors are attributed to the close confidence scores of the predicted classes. For example, in the utterance ”needs the name changing to bring it in line with the naming convention!! Done”, the computed score for Attack is -0.0215 and the score for Support is 0.0117. Also, we encountered certain cases that were difficult to categorize even for human annotators.

Table 11: Error categories in the fine-tuned BERT argumentative relation classifier. Highlighted words in dark red are more associated with the predicted class.

Error category Label Prediction Text

Mislabeled Support Attack Support Attack

Comparison words

Support Attack Attack Support Support Attack Short texts Attack Support

Support Attack Contextless words Neutral Attack

Frame Due to a large amount of labeled data, the frame classifier performed better than two other classifiers (See table 9). However, we still investigate the incorrectly classified records to find the reasons for common errors in this dimension. Analyzing the text data concerning the framing theory is much more sophisticated and time-consuming than discourse act and argumentative relations. During the analysis, we struggle with analyzing different aspects of long texts and overlapping concepts of the frame attribute classes. By randomly examining 5% of the misclassified utterances, we discover that 45% of the errors stem from the subjective nature of the texts. These texts are either lengthy and cover multiple contexts, or have multiple sentences that could be considered multi-label texts. In 24% of the samples examined, the model’s classification of texts was based solely on the presence of certain discriminating words, without considering the context in which they were used. For example, texts that provide a resource, reference, or citation were labeled as ”verifiability.” However, there are some cases where editors use words such as ”source,” ”reference,” or ”citation” in different contexts without actually providing a source or reference. In such cases, the model incorrectly classified them as ”verifiability” due to the presence of words such as ”source,” ”reference,” or ”citation” rather than considering the context in which they were used. In some other cases, the model predicts close confidence scores which suggest the subjectivity of the texts. Our analysis indicates that 14% of the texts examined were mislabeled. In these cases, the model accurately identifies the discriminating words associated with the classes, but they are incorrectly labeled as another class. For the remaining misclassified samples, the model struggles to classify them primarily due to the lack of context. These samples are either short texts (17%) or not clear enough to be assigned to a class (14%).

In document Identifying Effective Deliberation Strategies in Wikipedia Talk Pages (pagina 35-41)