• No results found

4.4 Binary Classification

5.1.1 Experimental Setup

In this section, we explain the experimental setup of several models that we developed for the argu-mentative attribute classification step. We explain the data pre-processing step as well as experimental settings in our baseline and the three different classifiers, developed in fine-tuning and prompting ap-proaches.

Data splitting and preprocessing

As discussed in chapter 4, to develop and train classifiers we utilize the Webis-DebateDebate-18 corpus [3]. Table 6 shows the amount of available labeled data after data cleaning. The table shows that the amount of labeled data for the discourse act and relation subsets is significantly less than the amount of labeled data in the frame subset. Given the limited amount of labeled data for the discourse actand argumentative relation attributes, we divide the datasets into three subsets in order to properly evaluate these attributes. Specifically, we use 60% of the data as training data, 20% as validation data, and the remaining 20% as testing data. This allows us to train the classifiers effectively on a representative sample of the data and use the validation data to tune the model’s parameters, while reserving the testing data for an unbiased evaluation of the model’s performance. Since the data is not balanced, to split data we always stratify based on labels, making sure to maintain the distribution of classes across all training, validation, and testing sets. We lowercase all texts and clean some noisy texts from the data. As detailed in Section 5.1.2, an error analysis was conducted to identify common errors made by the models during the experiments. The analysis revealed that in some instances, errors were caused by punctuation and extremely short utterances (with fewer than 10 words). To address this issue, we removed these elements and repeated the experiments. However, this modification led to a decrease in the performance of the models, as shown by the results of the repeated experiments (See appendix A).

Table 6: Data splits for all datasets

Dimension Train Validation Test

Discourse Act

evidence 293 98 98

finalization 336 112 112

questioning 56 19 19

recommendation 71 24 24

social-act 45 14 14

understanding 307 102 102

Total 1108 369 370

Relation

attack 1562 521 521

neutral 1162 387 387

support 1737 579 580

Total 4461 1487 1488

Frame

dialogue 29832 5966 3978

neutral 60772 12155 8103

verifiability 75220 15044 10029

writing 17479 3496 2331

Total 183303 36661 24441

Baseline

For the baseline, we use TF-IDF vectorization to create features by extracting information from the data, then we built a Support Vector Machine (SVM) model as a classifier. We use the TF-IDF vectorizer with a limit of 10,000 words capturing uni-grams and bi-grams. We implement the model using Scikit-learn library, where we set the regularization parameter as default (C=1), kernel=’linear’, and gamma=’auto’.

Fine-tuning PLMs

We need to build three different models to classify three argumentative attributes of discussion texts.

We fine-tune the pre-trained "bert-base-uncased"13 model from the transformers library [82], the state-of-the-art Machine Learning models hub. We use BERTForSequenceClassification class, which contains a sequence classification head. We use PyTorch14 integration for building three classifiers to classify discourse act (6 labels), argumentative relation (3 labels), and frame (4 labels). We ex-perimented with various hyperparameter settings and found that the following configuration yielded the best results. We trained the model over the Webis-WikiDebate-18 corpus [3] in 4 epochs, with a learning rate of 5e-5, a batch size of 4, and a weight decay of 0.01.

13https://huggingface.co/bert-base-uncased

14https://pytorch.org/

In addition, we fine-tune the RoBERTa model using similar settings to compare its performance with BERT. The RoBERTaForSequenceClassification class from transformers library is used to download and fine-tune the "roberta-base"15 model. The learning rate is set to 2e-2, and the remaining hyperparameters for RoBERTa are the same as the fine-tuned BERT model.

Prompt-based learning

In this section, we explain how we create our three classifiers using prompt-based learning techniques and T5 PLM [75].

T5 T5 is an encoder-decoder PLM that considers all NLP tasks in a text-to-text format. Its imple-mentation configuration is similar to the original Transformer in [83]. When it comes to the training process, it follows the teacher forcing style, where a sequence of input tokens is passed to the encoder using input ids. Then, a start-sequence token prepends the target sequence, which is then passed to the decoder using decoder input ids. Some special tokens are used during training. The "pad"

token is the start sequence token and the "eos" token appends the target sequence, which matches the labels. In an unsupervised training setting, sentinel tokens are used to mask some tokens of the input sequence. Sentinel token should start with "extra id 0", "extra id 1", ... up to "extra id 99"16.

OpenPrompt We implement our prompt-based model in OpenPrompt17 [84], an easy-to-use and research-friendly toolkit that allows users to integrate several PLMs in an efficient, modular, and extensible framework to create prompt-based learning pipelines18. It supports loading transformer-based models directly from the huggingface. Creating and training a model using OpenPrompt is relatively straightforward due to the extensive documentation19and coding tutorial20 provided by the developers. The first step is to define the task by determining the classes and transferring input text data into InputExample format. The second step is to obtain a PLM from the huggingface hub using get model class. We load the "T5-base" model from the transformers library. Next, the crucial step in prompt-based learning is to establish a ”Template” for the prompt. Below is a template we defined for the discourse act classifier. The variable "text a" will be replaced by the input text in InputExampleand the "mask" will be replaced by the predicted label after training:

template text = {"placeholder":"text a"}; what is the discourse act of the text?

{"mask"}.

Defining a verbalizer is the fourth step. This step is primarily important for classification tasks. It maps logits on the vocabulary to the final label probability. In our example for the discourse act clas-sifier, we map the encoded dataset labels (0-5 digits) to their corresponding actual labels, as follows:

myverbalizer = ManualVerbalizer(tokenizer, num classes=6,label words=["evidence"], ["finalization"], ["questioning"], ["recommendation"], ["social-act"],

15https://huggingface.co/roberta-base

16T5 documentation: https://huggingface.co/docs/transformers/model doc/t5

17https://github.com/thunlp/OpenPrompt

18The OpenPrompt wins ACL 2022 Best Demo Paper Award.

19https://thunlp.github.io/OpenPrompt/index.html

20https://github.com/thunlp/OpenPrompt/tree/main/tutorial

["understanding"]])

In the fifth step, we use the PLM, template, and verbalizer created in previous steps to create a prompt model. The PromptForClassification class is used to do this step.

prompt model = PromptForClassification(plm=plm,template=mytemplate, verbalizer=myverbalizer)

In the next step, we define a DataLoader using PromptDataLoader class. We set the maximum sequence length to 512, and a batch size of 4. It is important to note here that we use T5 PLM for classification, but we do not use its decoder. We only pass "pad" "extra id 0", and "eos" to decoder. So, we set the decoder’s maximum length to 3 to save the usage space. Finally, we train the model in a standard way. We use cross entropy loss as a loss function. Using Adam optimizer with a learning rate of 1e-4 we train the model over Webis-WikiDebate-18 corpus in 4 epochs.

Evaluation metrics

As shown in table 6, the training data in all three dimensions is imbalanced. Thus, we evaluate our classifiers using macro averaged F1-Score, precision, and recall as evaluation metrics.