Identifying Effective Deliberation Strategies in Wikipedia Talk Pages

(1)

in Wikipedia Talk Pages

Behrooz Nikandish

(2)

Identifying Effective Deliberation Strategies in Wikipedia talk pages

Master’s Thesis

To fulfill the requirements for the degree of Master of Science in Information Science at University of Groningen under the supervision of

Prof. dr. K.I.J. Al Khatib and

Prof. dr. Malvina Nissim

Behrooz Nikandish (s5035112)

January 27, 2023

(3)

Acknowledgments

I extend my gratitude to my thesis advisor, Dr. Khalid Al-Khatib from the Faculty of Arts at the University of Groningen. His support and guidance were invaluable, and his office door was always open for any questions or concerns I had about my research and writing. He provided direction when needed, while still allowing me to take ownership of the work.

I also want to express my gratitude to Prof. Malvina Nissim of the Faculty of Arts at the University of Groningen for her role as my second supervisor. I am deeply thankful for her insightful and valuable feedback on this thesis.

Last but not least, my deepest appreciation goes to my mother, for her life-long support, my wife, for her unfailing support and encouragement throughout my studies and the process of researching and writing this thesis, and my lovely 4-year-old daughter, Janan for her patience and understanding during the times I was away from home. This achievement would not have been possible without their love and support. Thank you.

(6)

Abstract

This thesis investigates how to distinguish, in an interpretable fashion, successful from unsuccessful strategies in online deliberative discussions. We particularly examine the role of three argumentative attributes: discourse act, argumentative relation, and frame in the success of online discussions. In particular, we train three classifiers, each for one of the three attributes, using Webis-WikiDebate-18 corpus that is annotated by the three argumentative attributes. The classifiers employ fine-tuning and prompting paradigms, and use BERT, RoBERTa, and T5 pre-trained language models. We achieve an F1-score of 0.68 on discourse act, 0.64 on argumentative relation, and 0.79 on frame attributes. To analyze the successful deliberation strategies, the three classifiers are applied to an analysis dataset, called ”RfC-Predecessor” that contains topical controlled pairs of successful and unsuccessful discussions. In this context, discussions that end in a consensus are considered successful, while those where participants are unable to reach an agreement are deemed as unsuccessful. Both the classification and analysis corpora come from Wikipedia talk pages and have similar themes and tone. The main difference is that the classification corpus is made up of individual utterances, while the analysis corpus is made up of entire discussions. Our experiments demonstrate several effective strategies and show that recognizing the argumentative attributes of a discussion can aid in forecasting its success.

(7)

1 Introduction

Information and communication technologies are evolving and enabling citizens to connect for various purposes. These technologies offer new opportunities for people to participate in decision-making debates, also known as deliberative discussions. In argumentation theory, Deliberation is a form of collaborative dialogue when participants work together to direct group-members moves toward a shared goal [1] by agreeing on a solution that can address an issue affecting all parties while taking into account their interests. In other words, deliberation is an effective method for combining and utilizing the knowledge of various stakeholders who are crucial to solving a specific problem, where making a decision and taking action is essential. Deliberations can lead to a range of beneficial outcomes, including resolving disputes, reaching agreements, addressing issues, and reinforcing shared beliefs [2].

People participate in online deliberative discussions to communicate their ideas and opinions about societal, political, or scientific topics. Analyzing the success and failure of deliberations can assist in developing a strategy that leads to more successful deliberations. A deliberative argumentation strategy is a sequence of moves that participants take to push the discussion forward [3]. Traditionally, researchers used manual content coding techniques to evaluate and analyze deliberation discussions [4, 5]. Although discussion analysis facilitates the understanding of substantial amounts of data, man- ually extracting the structure of a deliberative discussion requires expertise and a lot of time. Various efforts have been made to speed up these manual approaches. Still, it is impossible to keep pace with the speed at which data is being produced all over the expanding social media and debate platforms.

Consequently, automatic analysis of deliberative discussions, which involves automatic identification of the conversations structure, is receiving attention increasingly from researchers [6, 7].

Many online discussion platforms are increasingly expanding to provide environments for people to discuss arguments and achieve consensus. Wikipedia is a free online encyclopedia that provides a collaborative editing environment to write and edit many topics together. It is not surprising that different opinions conflict on a particular subject in such environments. Thousands of editors try to resolve all prospective disputes through online deliberative settings called talk pages, where the community can argue different opinions, defend their positions, and finally achieve consensus. Studies have shown this kind of interaction on Wikipedia overcomes several problems of traditional deliberations [8].

In this thesis, we aim to investigate how to analyze the argumentation strategies of Wikipedia discussions. The analysis can be used to enhance ongoing discussions. For example, those discussions that are likely to lead to failure can be avoided, or effective strategies can be used to advocate steps that are known to be beneficial. More precisely, analyzing successful and unsuccessful argumentation strategies, we want to shed light on the following research questions:

Q1. How to identify deliberative strategy in discussions using argumentative attributes?

Q2. How to analyze the identified strategies, distinguishing successful from unsuccessful ones?

Q3. How to predict the success of a discussion based on the used strategies?

To investigate the research questions, we need a dataset of successful and unsuccessful discussion pairs that provide conversations discussing the same subject. In this context, deliberations that result in consensus are considered successful, while those where participants cannot reach agreement are

(8)

seen as unsuccessful discussions. Wikipedia talk pages offer a unique insight into deliberative discussions, where thousands of participants with different points of view discuss a particular issue. A lot of previous efforts have utilized talk pages data [9, 10, 11, 12, 13, 14, 15] to investigate different conversational phenomena, including conflicts and disputes [16, 17, 2]. However, many of the existing corpora capture the final snapshot of the conversations and skip those utterances that have been modified or removed. In 2018, Hua et al. [9] presented the WikiConv corpus, which contains the entire history of communications between editors in Wikipedia talk pages. Many studies have utilized this wealth of deliberative discussions to study conversational properties [18, 19, 20, 21] or to develop a toolkit [22] for studying such conversations. More recently, Schmidt (2022) [23] compiled a topical controlled dataset called ‘RfC-Predecessor pairs’ that contains 10750 utterances in 421 pairs of successful and unsuccessful discussions extracted from the WikiConv [9] corpus.

Taking into account three well-known theories, speech act theory [24], argumentation theory [25], and framing theory [26], Al-Khatib et al. (2018) argue that each utterance of a discussion ideally could be categorized in one of above categories [3]. They create a new large-scale corpus, Webis- WikiDebate-18, from many discussions by looking at the various forms of metadata of Wikipedia talk pages. They believe that by considering these three crucial argumentative attributes, it is possible to forecast the best course of action in an ongoing discussion [3]. In this study, we aim to investigate the role of these argumentative attributes in identifying effective deliberative strategies of Wikipedia discussions.

We conduct this study in four steps: First, in the Argumentative attribute classification step, we employ pre-trained language models (PLM) in fine-tuning and prompting paradigms, based on the Webis-WikiDebate-18 corpus, to build three classifiers for the argumentative attributes. In the second step, Discussion utterance classification, we apply the classifiers developed in the previous step to classify the discussion utterances in the RfC-Predecessor pairs corpus [23]. The next phase is Delib- erative strategies analysis, where we analyze frequent sequential patterns of argumentative attributes to identify effective deliberation strategies. Finally, in the Binary classification step, we train a binary classifier over the classified discussion pairs to predict the success of discussions, given their strategies.

We analyzed the deliberative strategies in three dimensions: Discourse act, argumentative relation, and frame. Our analysis on the discourse act patterns shows that a conversation is more likely to fail if there is an utterance classified as questioning but is not followed by another utterance classified as recommendation or finalization. When it comes to the argumentative relation, our research shows that every utterance with attack class must be followed by another utterance which is classified as support or neutral as the second attacking utterance has never been observed in the most frequent successful patterns. In the frame dimension, dialogue and verifiability classes play the core role in successful and unsuccessful patterns, respectively.

The rest of this thesis is structured as follows. In chapter 2, first, we discuss the background of the work. Following that, we give a thorough overview of prior works in the field of computational argumentation, Wikipedia talk pages analysis and conversational forecasting. Chapter 3 will provide an overview and a comparison of the Webis-WikiDebate-18 [3] and the RfC-Predecessor pairs [23]

corpora, which form the foundation of this study. In chapter 4, we discuss the method of the research, where we explain the details of four different steps of this study. Chapter 5 provides the experiment setup, results, analysis of the findings in each step. Finally, chapter 6 formulates responses to the research question asked in this chapter. Additionally, we will address the limitations of the work and make suggestions for additional study directions in this area.

(9)

2 Background Literature

Understanding argumentation structure is essential to comprehending why individuals hold particular opinions, offering insightful information in fields such as financial market forecasting and public relations [27]. Interesting applications of the computational argumentation, such as argument search [28], fact checking [29], argument summarization [30], writing support [31], and etc., has drawn the attention of the computational linguistics community to this area of research. Concurrently, online platforms for argumentation have seen rapid growth in recent years, which necessitates the use of automatic methods to handle such large and complex data. The field of argument analysis addresses this problem by converting unstructured text into organized argumentation data and uncovering the connections between them to support the primary idea.

In this chapter, we first delve into the concepts of deliberation dialogues, online deliberation on Wikipedia talk pages, and computational argumentation. In the subsequent section, a thorough review of previous research in the areas of computational argumentation, analysis of Wikipedia talk pages, and conversational forecasting is presented.

2.1 Background

2.1.1 Deliberation Dialogues

Argumentation theorists categorise human dialogues using an important typology based on the discussion’s goals, the participants’ goals, which sometimes conflict, and the information that each participant possessed at the outset of the debate [32]. There are six main categories of dialogue as a result of this classification: Information-seeking Dialogues, Inquiry Dialogues, Persuasion Dialogues, Ne- gotiation Dialogues, Deliberation Dialogues, and Eristic Dialogues [32]. Deliberation is a kind of dialogue that aims to decide the best accessible course of action with all participants’ support. Re- searchers consider some characteristics for deliberation dialogues [33]; First, the focus of deliberation distinguishes deliberations from inquiry and information-seeking dialogues, where a participant or all participants seek the true answer to their questions. The second characteristic of deliberation dialogues is the lack of a predetermined stance towards the main topic of discussion. Participants may enter the discussion with their own positions, but the ultimate objective is to arrive at a collective decision. This differs from persuasion dialogues, where one participant aims to convince the others to adopt their viewpoint. Mutual focus is the third characteristic of deliberations, which is attributed to the fact that the proposed actions by participants are based on their different standards and criteria, not personal interests that impact the final decision. This characteristic distinguishes deliberations from negotiation dialogues, where agents are unwilling to share information or preferences.

In social science, deliberation is viewed as a genuine option for making policies and may fulfill the public need for democratic innovation. A particular sort of involvement known as ”deliberative democracy” is characterized by individuals conversing intelligently about problems that matter to them [34]. Deliberation is viewed as a crucial part of democratic decision-making and is thought to result in improved outcomes by enabling individuals to weigh different perspectives, acquire knowledge from a varied group of individuals, and enhance their own analytical and critical thinking abilities [35].

2.1.2 Deliberation in Wikipedia

The rise of social media has greatly impacted the digitalization of various aspects of modern society, including argumentation and deliberation. The use of social media platforms has become prevalent

(10)

for discussing and debating various topics, leading to an increase in online discourse and decision- making. There are many online platforms where users can argue on a specific topic and express their ideas and opinions to achieve consensus and make a decision. The most well-known example of a successful, extensive Wiki-based website is Wikipedia. Wikipedia is a multilingual, free online encyclopedia that supports the idea that any user should be allowed to create, edit, and modify the site’s content. In this virtual environment, information is produced and revised by thousands of people worldwide. This wealth of information is written and maintained by a community of roughly 44.4 million users and nearly 123 thousand active editors¹ in an open collaborative editing system. As of now, English Wikipedia alone comprises of over 6.5 million articles, and when combined with the Wikipedias for all other languages, the number exceeds to 55 million articles in 309 languages². The total number of words exceeds to 29 billion words. This highlights the vastness of the Wikipedia and the wealth of knowledge it contains in multiple languages. According to the Wikimedia Statistics³, it attracts around 2 billion unique device visits and over 14 Million edits per month. The vast amount of text data available in Wikipedia has garnered attention from the natural language processing (NLP) community, as it provides an opportunity to process and construct large corpora for various downstream tasks. This includes, but not limited to, question answering [36], hate-speech detection [37], Information retrieval [38], name entity recognition [14] and some other monolingual and multilingual corpora [11, 13, 39, 40].

Klemp and Forcehimes (2010) argue that the way people communicate on Wikipedia eliminates many of the issues that affect in-person discussions and, to some extent, fulfills the cognitive and procedural goals of deliberations [8]. However, they believe there are some differences between ideal deliberations and Wikipedia discussions. It is believed that collaborative editing on Wikipedia aligns with certain aspects of the deliberative ideal. Firstly, it is geared towards a shared goal and operates in a similar manner. Additionally, Wikipedia’s decision-making process is guided by the norm of consensus. However, it should be noted that Wikipedia deviates from the ideal deliberation in two ways.

Firstly, it shifts from face-to-face communication to more anonymous forms of online engagement.

Secondly, it prioritizes collaborative editing over reason-giving as the primary means of information dissemination. [8].

2.1.3 Wikipedia Talk Pages

While collaborative editing predominates on Wikipedia, other, more deliberative forms of reason- giving exist on Wikipedia talk pages. Multiple wikis provide prospective participants with a forum for a discussion called the Talk pages, where they can interact with other users and discuss the current writing process. The ultimate objective of these discussions is to agree on improving the content of a particular page. Talk pages on Wikipedia serve the same purpose. Every page in Wikipedia can have a corresponding talk page, which is accessible from a separate tab named Talk. In talk pages, editors discuss changes they have made or proposed, suggest new content, edit disagreements, or challenge cited sources to reach a consensus on the accuracy of an article. Users can start a new thread or reply to an existing thread to participate in such discussions. Editors are encouraged to use indentation when responding to a current conversation. Each subsequent reply has a different level of indentation.

Depending on the level of the reply, the Wiki markup language requires one or more colons at the beginning of the line. However, participants can directly change any section of the talk page’s markup

1https://en.wikipedia.org/wiki/Wikipedia:Wikipedians

2https://en.wikipedia.org/wiki/Wikipedia:Size comparisons##Footnote on Wikipedia statistics

3https://stats.wikimedia.org/#/all-wikipedia-projects

(11)

Figure 1: Indentation in Wikipedia talk pages; Markup code (left) and the rendered results (right)

without explicitly utilizing features designed to organize the discussions. Figure 1 shows the markup code and the executed code results for indentation in different depth level⁴.

Request for comments While the ultimate goal of all discussions is to reach a consensus that is supported by all parties involved, it is common for such conversations to fail to achieve this objective.

Discussions on Wikipedia talk pages are no exception. To handle such disputes, there is a mechanism called Request for Commments (RfC), which is one of many procedures offered by Wikipedia’s dispute resolution system. It employs a system of central noticeboards and random invitations sent by a bot to publicize discussions to editors who are not involved. In this mechanism, when editors cannot reach a consensus, they may request assistance from the larger Wikipedia community by publicizing their deliberation. Any editor who participated in the discussion that appears to have failed, can estab- lish the RfC. They can wrap up the conversation and initiate a new one using a predefined template, which consists of a {{rfc}} tag (See figure 2). This process makes the conversation public among all other ongoing RfCs, all of which are available at Wikipedia⁵. Editors can frequently visit these pages if interested in responding to RfCs. Thus, neutral editors with no role in the previous failed discussion can enter the new one and address the issue from their point of view. When initiating an RfC, the creator can add some categories to the tag to target editors in a specific field. Figure 3 shows an example of establishing RfC that targets the History community by incorporating hist in the tag.

2.1.4 Computational Argumentation

Computational argumentation is a relatively new area of study in natural language processing, which focuses on identifying and analyzing arguments in texts written in natural language. The field has experienced substantial growth and interest within the NLP community with novel developments that enable creative applications in decision assistance, information retrieval, assisted reasoning, debate technologies, writing support, or argument search. Argument mining is the main focus of current computational argumentation methods. The goal of argument mining is to find arguments in natural language texts as well as their components and relations. Computational approaches in argument

4The image is from https://en.wikipedia.org/wiki/Help:Talk pages

5https://en.wikipedia.org/wiki/Wikipedia:Requests for comment/All

(12)

Figure 2: Screenshot of an RfC established using the {{rfc}} template tag

Figure 3: An Example of creating an RfC discussion that targets the History community of editors.

mining are categorized in identification, classification, and relation sub-tasks. While the identification sub-task focuses on finding and separating argumentative documents and identifying components of arguments, the classification sub-task analyzes the function of those components. In other words, the classification task attempts to identify the different types of argument components that are present in the argumentative setting. In contrast, the relation sub-task, as the name suggests, concentrates on finding argumentative relations [41].

2.2 Related Works

The proliferation of social media, digital forums, and other online platforms has led to a significant increase in the amount of argumentative text available for study by researchers. Previous efforts have been made to collect or analyze such large amounts of argumentative text data from different social media such as Reddit [42, 43, 44], Twitter [45, 46], and Wikipedia talk pages [9, 10, 11, 12, 13, 14, 15]. In this section, we review prior studies in the fields of computational argumentation, analysis of Wikipedia talk pages, and conversational forecasting.

(13)

2.2.1 Computational Argumentation

As discussed in previous section, computational argument mining tasks are categorised in three sub- tasks, namely identification, classification, and relation. In this study, we mainly focus on classification and relation tasks to identify different components of arguments and their relations. Previous efforts have been made to classify such components in various contexts.

Kwon et al. (2007) identified concluding phrases that reveal the author’s viewpoint on the major idea of online comments and categorize them into support, oppose, and propose classes in two steps. The first step is sentence level claim identification, which is defined as a binary classification task, given all sentences in a document. They used many syntactic and semantic features including unigram, bigram, positive and negative words from General Inquirer, sentence or paragraph position. They applied a boosting algorithm [47] as a supervised machine learning method and achieved an F1 score of 0.55. In the second step, they classified the texts as support, oppose, or propose classes using the same features with an F1 score of 0.67.

Park and Cardie (2014) [48], developed a framework for proposition automatic classification in three categories: UnVerifiable, Verifiable Non-Experiential, or Verifiable Experiential with a proper support type, reason, evidence, and optional, respectively. Identifying the sort of support is beneficial when evaluating the argument’s persuasiveness and the supporting evidence. They constructed a goldstan- dard dataset of 9,476 phrases and clauses from eRulemaking platform. They found that, in comparison to the unigram baseline, Support Vector Machine (SVM) classifiers trained using n-grams and extra features representing the verifiability and experientiality show substantial improvement, obtaining a macro-averaged F1-score of 68.99%.

In 2018, Park and Cardie [49] presented Cornell eRulemaking Corpus (CDCP) annotated using five types of the elementary units and support relations defined in the model in [50]. The classifications are fact, testimony, value, policy, and reference with their associated support relations of reason and evidence. The dataset has 1221 support relation annotations and 4931 elementary unit annotations.

Niculae (2018) [51] used the dataset to train an SVM classifier for proposition classification and achieved an F1 score of 0.74.

Egawa et al. (2019), proposed a unique annotation strategy to capture the semantic function of arguments in the well-known ChangeMyView online discussion board. They used the same elementary units as the proposed model in [50] with a slight change, replacing the reference with rhetorical statement. They also replaced relations with attack and support. They used this annotation scheme to annotate ChangeMyVeiw that resulted in 4612 elementary units and 2713 relations in 345 posts. The annotated data was used to analyze the semantic role of persuasive arguments.

Al-Khatib et al. (2018), [3] proposed a model to automatically generate a large-scale corpus from a large number of discussions by looking at the various forms of metadata of Wikipedia talk pages.

They extract and parse around six million discussions from English Wikipedia talk pages to identify the structural components, such as utterances, users, and timestamps. They clustered the types’ instances based on their semantic similarity. Then, each cluster is mapped to a concept (source) and related concepts into a set of categories (evidence) [3]. The corpus contains three different categories of utterances. A discourse act is labeled on 2400 utterances, an argumentative relation is identified on 7437 utterances, and a frame is tagged on 182,321 utterances in the corpus. They used the corpus to train three supervised classifiers for discourse acts, relations, and frames, utilizing the support vector machine, logistic regression, naive bayes, and random forest as well-known machine learning models.

The support vector machine model produced the best results with F1-scores of 0.42, 0.53, and 0.60 for discourse act, argumentative relation, and frame, respectively. It is proposed that the identification

(14)

of these three crucial argumentative characteristics can enable the prediction of the optimal strategy in an ongoing discussion [3]. The corpus comprises of an analysis of conversations at the level of individual utterances, providing the opportunity to examine and classify our data at this granular level.

In this research, we utilize this capability to develop and train classifiers using the labeled data within the corpus, to identify specific attributes in the analysis data.

2.2.2 Wikipedia Talk Pages Analysis

Many academics exploit the extensive deliberative discussions on Wikipedia talk pages, where thousands of contributors discuss a wide range of issues to modify and produce articles on the largest online encyclopedia. A significant challenge in analyzing and characterizing conversations on Wikipedia talk pages is that the actions taken within a conversation can be deleted or altered by the author or other editors, resulting in the final static snapshot of the conversation not being the original version as it progressed.

Bender et al. (2011), [52] presented the Authority and Alignment in Wikipedia Discussions (AAWD) corpus, which contains 365 annotated discussions from Wikipedia talk pages for two different social acts. The AAWD work, however, fails in providing data on the deleted and modified actions in the existing utterances. Some other attempts to build Wikipedia talk pages corpora concentrated on the final state of the conversation threads [53, 54, 55]. The issue here is that, if we base our analysis on the final static dialogue, we might overlook, for instance, some aggressive or violent activities that lead to the conversation’s failure but are eliminated in the course of the discussion’s development.

Thus, the primary cause of the discussion’s failure is overlooked.

Wulczyn et al. (2017) [56] presented a technique for gathering longitudinal, large-scale data on personal attacks in Wikipedia discussions. They created a corpus of over 100k human-labeled and 63M machine-labeled comments. They employed logistic regression (LR), and multi-layer perceptrons (MLP) to design their models. One flaw in their approach is that, when analyzing text fragments from the talk page history, they overlooked the conversational frameworks of turns and answers in the discussions.

Zhang et al. (2018) [19] developed a framework to predict the conversational derailment considering the politeness of the actions in Wikipedia talk pages discussions . They built a corpus of 50 million discussions across 16 million talk pages. To this end, they translated each talk page’s sequence of edits into structured discussions in order to recreate a full picture of the conversational process across English Wikipedia’s edit history. The objective of this study is to examine the relationship between the beginning of a civil conversation and its likelihood to turn into personal attacks later on. In our work, we concentrate more attention on argumentative attributes beyond the toxicity of the utterances, as opposed to this corpus, which primarily focused on toxic behaviors and the level of politeness.

In 2018, Hua et al. presented WikiConv [9], a corpus that containing a complete history of conversations of editors in Wikipedia talk pages up to July 2018. The corpus includes not only all the comments and replies, but it also captures deletions, restorations, and modifications as conversations develop. The reconstruction pipeline is applied to five languages including English, German, Russian, Chinese, and Greek with similar high reconstruction accuracy. The corpus comprises of 212 million conversational actions across 90,930,244 conversations in 24 million talk pages. It is publicly available in the format of Cornell ConvoKit⁶and also on the Google Cloud Storage⁷, where it is split into

6https://convokit.cornell.edu/documentation/wikiconv.html

7https://console.cloud.google.com/storage/browser/wikidetox-wikiconv-public-dataset/dataset/English;tab=objects

?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false

(15)

500 distinct file with overall 750 GB in size. It reconstructs the conversations identifying all actions that made changes to discussions up to the final state. There are five different sorts of actions in the conversation reconstruction process:

• Creation: the beginning of a discussion thread based on the addition of a markup section heading.

• Addition: adding a new utterance to a thread.

• Modification: modifying an existing utterance. In this case, the id of the original utterance is considered as the Parent-id.

• Deletion: the deletion of a heading or utterance, where the Parent-id indicates the most recent action taken on the utterance or heading.

• Restoration: the deleted action that is being restored is designated as the Parent-id in a restore.

A conversation’s text can be recreated for every point in time by following out each edit action in chronological order. Our analysis dataset, which was extracted from the WikiConv corpus and is used in this study, is discussed in detail in chapter 3.

Im et al. (2018) [16], compiled an innovative, extensive dataset of 7,316 RfCs from the English Wikipedia, spanning the years 2011 to 2017 and analyzed them to identify closing statements, au- thors, and reply structure.⁸ They performed quantitative and qualitative analyses by conducting a manual inspection of the data and interviewing RfC closers to capture their motivations on the closing decision. They also looked at the main RfC process components that led to resolution failure.

According to their research, 57.65% of RfCs are resolved by adding a summary statement that re- solves the conflict. On the other hand, One-third of all RfCs in the dataset were left stale without any resolution. They developed a model with 75.3% accuracy to predict whether an RfC will go stale.

They found that the size and shape of the debate, along with factors relating to participant interest and level of proficiency, are the most significant indicators of whether an RfC will become stale.

More recently, De Kock and Vlachos (2021) [17] introduced WikiDisputes, a corpous of 99,907 utterances in 7,425 disagreed conversations extracted from Wikipedia talk pages to investigate what makes disagreements constructive. A disagreement on a talk page, an edit summary, and an escalation label are three components of the dataset. Regarding disagreements, using regular expressions, they discovered the insertion of dispute templates in the article history. Then, considering some criteria like the number of participants and the length of conversations and utterances, they extracted and filtered 7,425 conversations from WikiConv corpus [9]. Then, all edit summaries that were written by participants in the conversation and that were timestamped between the conversation’s first and end comments are included. Finally, escalation label indicates whether a dispute was escalated to mediation process or not. To draw conclusions about linear correlations between language features and dispute outcomes, they created feature-based models using linguistic features like politeness, col- laboration, toxicity, and sentiment. They acknowledged that representing dialogue structure using feature-based models is challenging. They developed some neural models including GloVe embed- dings[57], bidirectional LSTM [58], HAN [59] to capture conversation structure. This study examines disagreements on Wikipedia talk pages and has a comparable ecosystem to ours. The key difference, however, is that they use conventional neural networks, while we utilize state-of-the-art pre-trained language models to understand the structure of conversations.

8https://figshare.com/articles/dataset/rfc sql/7038575

(16)

2.2.3 Conversational Forecasting

Prior research in conversational texts are mainly focused on identifying some sorts of anti-social behavior, such as hate-speech detection [60, 61], more specifically aggression and misogyny detection [62, 63], and personal attacks [56]. Predicting a discussion’s outcome by understanding its structure is one area of research in argumentation. Conversation forecasting is a direction in which researchers aim to develop a model that captures the conversational patterns and forecasts the future success or derailment of the discussion as it develops.

The first attempt to apply the pre-training-then-fine-tuning paradigm to conversational forecasting is called CRAFT (Conversational Recurrent Architecture for ForecasTing), which has been done by Chang and Danescu-Niculescu-Mizil in 2019 [18]. They introduced a forecasting model that processes comments and their relations as the discussion develops in an online fashion. They employed pre-training approach like in BERT [64], but in conversation level instead of document level. They used a classification head on top of an encoder to make a binary prediction. They develop and dis- tribute two distinct datasets to assess the performance of their model. The first one is an expanded version of ’Conversations Gone Awry’ [19] and the second one is created from the subreddit Change- MyView. By selecting pairs of positive and negative instances, they carried out some sort of topic control pairing similar to RfC-Predecessor dataset [23] that we use in this study (See section 3.2).

Their model consists of two components: (a) an unsupervised generative dialogue model that learns to capture conversational dynamics and (b) a supervised component that improves this model to predict future behavior. The model showed state-of-the-art performance on the task of conversational forecasting. The training process was done statistically, meaning that the model accepts all utterances before the personal attack or abusive language as input and predicts the next turn. They mainly concentrated on how personal attacks affect the outcome of conversations.

In 2021, Kementchedjhieva et al. [65] applied BERT pre-trained language encoder to the task in a dynamic training method. Dynamic training requires the mapping of a conversation of N turns into several training samples, each of which represents a distinct stage of the conversation as it develops but has the same label. They evaluated their model using the same datasets as in the CRAFT model [19]. They discovered that dynamic training leads to earlier predictions, however, it can decrease accuracy and F1 scores depending on the data quality.

The previously mentioned models aim to predict if conversations devolve into personal attacks due to impolite or hostile behavior. In contrast, our study examines conversations from a 3-dimensional perspective to understand how they succeed or fail.

(17)

3 Data

In this study, we use two corpora for analysis and classification purposes. We use Webis-WikiDebate- 18 corpus [3] as the training data to create three models to identify three argumentative attributes:

discourse act, argumentative relation, and frame. Then we apply the trained models to the analysis data, the RfC-Predecessor dataset [23], to identify the structure of conversations in terms of argumentative attributes. In this chapter, we describe and compare these datasets to gain an in-depth understanding of the data. First, we explore the Webis-WikiDebate-18 corpus [3], as our classification data. Then, we describe the RfC-Predecessor corpus [23] that we utilize to identify deliberative strategies. Finally, we compare the two corpora to find their differences and similarities.

3.1 Argumentative Attribute Classification Data

As discussed in section 2.1.3, Al-Khatib et al. (2018) [3] used metadata from Wikipedia talk pages to cluster instances and map each cluster to a specific concept. Finally, they categorized related concepts into a set of categories. Table 1 shows the details of these concepts and categories. Considering three well-known theories, speech act theory [24], argumentation theory [25], and framing theory [26], they analyzed the distribution of the categories and argue that each utterance of a discussion ideally should have a discourse act category, an argumentative relation category, and a frame category. They create a new large-scale corpus, Webis-WikiDebate-18, based on the model by automatically leveraging the metadata. The corpus consists of around 200,000 labeled utterances for 13 different categories, including 182,321 utterances labeled as frame, 7437 utterances labeled as relation, and 2400 utterances labeled as discourse act. Table 2 shows the distribution of the classes in each dimension.

One drawback of this corpus is that each utterance has just one category per dimension, and multi- labeling is not taken into account. For instance, an expression classified as ”recommendation” in the

”discourse act” dimension may also be referred to as ”finalization.”

3.2 Discussion Utterance Analysis Data

Each utterance in a discussion is an argumentative turn and a sequence of these turns can be considered as an argumentation strategy in a deliberation [3]. A controlled dataset that is focused on specific topics is essential to study and analyze successful and unsuccessful strategies in deliberative discussions. Schmidt (2022) [23] compiled a dataset containing 421 pairs of successful and failed discussions from Wikipedia talk pages. To collect these pairs of discussions, which address the same topic, they used the Request for comment process in Wikipedia. The established method for creating the corpus involves gathering all RfC conversations from the English Wikipedia as successful discussions and pairing each of them with its predecessor unsuccessful discussion to control topics of successful and failed deliberations. A discussion started on the RfC notice board will deal with a problem that was not resolved in a prior discussion. A significant portion of discussions that are submitted on the RfC notice board end with a formal resolution, which can be considered as successful discussions. Furthermore, it is possible to identify unsuccessful discussions using the context that the RfC process provides. The RfC process is implemented to resolve conflicts among editors who were unable to reach a consensus after initial discussions. In such cases, an editor initiates a new discussion by outlining the unresolved issue and inviting other neutral editors to participate. The failure to achieve agreement in the initial discussion leads to the initiation of the RfC process. Thus, we can distinguish between successful RfC discussions and their corresponding failed initial discus-

(18)

Table 1: The concepts that fit under each of the three main categories of three dimensions in Webis- WikiDebate-18 corpus [3]

Discourse Act Concepts

Providing evidence (1) Provide a quote, (2) Reference, (3) Source, (4) Give an example, (5) State a fact, (6) Explain a rational

Enhancing the understanding

(1) Provide background info, (2) Info on the history of similar discussions, (3) Introduce the topic of discussion, (4) Clarify a misunderstand- ing,(5) Correct previous own or other’s turn, (6)Write a discussion summary,(7) Conduct a survey on participants, (8) Request info

Finalizing the discussion

(1) Report the decision, (2) Commit the decision, (3) Close the discussion

Recommending an act (1) Propose alternative action on the article, (2) Suggest a new process of discussion, (3) Propose asking a third party

Asking a question (1) Ask a general question about the topic, (2) Question a proposal or arguments in a turn

Socializing (1) Thank a user, (2) Apologize from a user, (3) Welcome a user, (4) Express anger

Relation Concepts

Attack (1) Disagree, (2) Attack, (3) Counter-attack

Neutral (1) Be neutral.

Support (1) Agree, (2) Support

Frame Concepts

Verifiability and factual

accuracy (1) Reliable sources, (2) Proper citation (3) Good argument Neutral point of view (1) Neutral point of view

Dialogue management (1) Be bold. (2) Be civil, (3) Don’t game the system

Writing quality (1) Naming articles, (2) Writing content, (3) Formatting, (4) images, (5) Layout and list

sions, both addressing the same topic. This allows for unbiased analysis of the differences between successful and failed discussions in this corpus.

Although it is intriguing to observe how these pairs of successful and unsuccessful discussions approach the same issue differently, it is important to note that there are two biases in the corpus that must be acknowledged and taken into account when drawing conclusions from any study of the data.

The first is the awareness of disagreement in the predecessor discussion. The editors that take part in RfC discussions already are aware that this problem is a result of the previous discussion that ended in a dispute, and they are engaged in resolving this specific issue. This bias would cause the editors to put more effort into resolving the problem. However, as discussed in section 2.1.3, it turned out that normally, only 57.65% of RfCs result in a formal resolution [16]. So, this awareness is not always helpful in the success of a discussion in RfCs. The second bias stems from the diversity and the number of participants in the two groups of discussions. When a small group of editors cannot come to a consensus on the topic of the discussion, they invite a larger group of editors to take part and

(19)

Table 2: Class distribution in three dimensions of Webis-WikiDebate-18 corpus Dimension Quantity Distribution

Discourse act

evidence 780 32.58%

understanding 670 27.99%

finalization 621 25.94%

recommendation 136 5.68%

questioning 105 4.39%

social-act 82 3.43%

Relation

attack 2604 35.02%

neutral 1936 26.04%

support 2896 38.95%

Frame

verifiability 100395 41.02%

neutral 81129 33.15%

dialogue 39872 16.29%

writing 23341 9.53%

solve the issue. As a result, the likelihood of success rises because the conversation is enriched by the addition of more knowledge and skill from the relevant community. Nonetheless, Since our research to identify deliberative strategies is based on the linguistic argumentative attributes of the actions in the conversations, we believe that these biases do not significantly affect the outcomes of our analyses.

The corpus creation process is detailed in [23] and is briefly summarized here. It is created in two stages: 1) Collecting RfC deliberations, and 2) Compiling Rfc-Predecessor pairs.

Collecting RfC deliberations First, all the 500 edit action files are downloaded from Google cloud storage one by one. For each of them, all actions are checked for the presence of the RfC template using regex patterns. The talk page ids with the presence of RfC template are recorded. Next, each table is downloaded once more and using the previously saved list of talk page ids, all of these talk pages’ actions are retrieved and stored. Each table must be downloaded twice, which increases the processing time but reduces the amount of memory required for filtering. After this step, 8606 talk pages containing 17,542 distinct discussions with a total size of 37 GB are filtered as RfC deliberations. Finally, having the conversation ids of those filtered RfC discussions, the latest snapshot of their wiki markup can be reconstructed by chronologically ordering their edit actions. Due to the final Deletionactions, this leaves blank text for many of the discussions. Therefore, the number of RfCs drops to 11,554 after sorting out all decisions that are reconstructed to blank text.

Compiling RfC-Predecessor pairs There is no direct link between RfC and its predecessor on Wikipedia. To find the predecessor of each remaining RfCs, the following three steps have been

(20)

taken:

• The conversations that took place after the RfC discussion began are dropped from the candidates.

• Then, discussions that do not share an editor with the RfC are sorted out because, by the established RfC method, a participant who becomes stuck in a discussion starts the RfC discussion and is thus present in both the predecessor and the RfC deliberation.

• Finally, a 7-day window between the conclusion of the candidate and the launch of the RfC is considered.

After these steps, the average number of candidates per RfC was reduced to 159. A closer look at the distribution of RfCs and the corresponding candidates indicated that 3333 RfCs have between one and four candidates, while 1468 RfCs have only one candidate. Due to several limitations with talk page discussions, such as the lack of a requirement for the conversation to progress chronologically, attempts to identify the ultimate predecessor in the former category (with one to four candidates) have failed. The text of a candidate topic is recreated by applying all edits made to that discussion up until the commencement of the RfC. This might result in a blank conversation if the final action is a deletion. After eliminating blank conversations, those where the RfC or the candidate text contains less than two utterances are disregarded. There are 613 RfC-candidate pairs after preprocessing the 1468 RfC with precisely one candidate. For a particular pair of RfC and candidate various features are evaluated for their usefulness in determining if each RfC is accompanied by its real predecessor.

The features are acquired from either metadata (e.g. time and participant overlap) or the conversation content (e.g. bag of words, link overlap, shortcut overlap, keyword overlap, average best matching sentence similarity, and single best matching sentence similarity). Conducting a deep feature analysis

9, 421 discussion pairs are compiled as the final RfC-Predecessor corpus. The schema of the corpus looks like the following:

id:

conversation id:

page title:

indentation:

replyTo id:

content:

cleaned content:

user text:

rev id:

type:

user id:

page id:

timestamp:

parent id:

ancestor id:

Table 3 presents statistics for the final RfC-Predecessor corpus dataset. The corpus includes 10,750 utterances, with 6,025 from RfC discussions and 4,725 from predecessor conversations. RfC discussions tend to be longer than their corresponding predecessor conversations; the average number of utterances in successful RfC discussions is 14.31, while it is 11.22 in predecessor conversations.

9Explained in detail in section 3.3.3 of [23]

(21)

Table 3: RfC-Predecessor corpus statistics. The length of conversations shows the number of utterances and the length of utterances indicates the number of words in each utterance.

Feature RfC Predecessor

Total utterances 6025 4725

Average number of utterances per conversation 14.31 11.22

Average number of tokens per utterance 78 88.4

Average number of tokens per conversation 1116.18 991.85

3.3 Data Comparison

To investigate the contextual and topical similarities and differences between the Webis-WikiDebate- 18 and RfC-Predecessor corpora, we performed a thorough comparison. This involved data cleaning and pre-processing on both datasets to remove noise and redundancy. The texts were converted to lowercase letters and punctuation was removed. To compare the most frequent words and topic modeling, we remove English stop words using NLTK [66] stop words. In order to avoid redundancy in the most frequent words and topic modeling we lemmatize the texts using the same toolkit [66]. The aforementioned data pre-processing steps have only been done for exploratory data analysis and we conduct different data pre-processing for building our models.

Sentiment analysis It is not surprising to encounter a range of opinions with varying sentiments in argumentative discourse, given the nature of such discussions. We use two distinct corpora for classification and analysis purposes that discuss various themes and might have contrasting attitudes.

It is therefore critical to understand how the sentiment of the statements differs amongst the corpora.

Sentiment analysis can determine the editor’s attitude in the written text. It can provide a broad overview of the tone of diverse discussions and compare them in both corpora. To this end, we use TextBlob¹⁰ library to compute and analyze the sentiments of the utterances. Table 4 compares the average sentiment, average number of words per utterance, and average length (character) of each utterance in the corpora. Since the Webis-WikiDebate-18 consists of three different categories with different texts, we analyze their sentiment separately. While the average sentiment of utterances in Webis-WikiDebate-18 subsets are ranging from 0.067 to 0.072, RfC-Predecessor editors wrote utterances with a sentiment score of 0.064 average, which shows a roughly similar attitude in writing themes. Figure 4 illustrates a comparison of sentiment score distribution in both corpora and reveals similar patterns. Although Figure 4d presents a slightly different appearance, the pattern remains comparable. This is primarily due to the significant amount of data within this subcategory.

Utterance lengths Table 4 also includes a comparison of the number of words per utterance. RfC- Predecessor corpus utterances consist of 36 words on average, after removing stop words. In contrast, Webis-WikiDebate-18 overall contains 33 words per utterance on average (25, 19, and 56 words per utterance for discourse act, argumentative relation, and frame respectively). The table also displays the average utterance length for each corpus. RfC-Predecessor utterances are typically 253 characters long. As opposed to the Webis-WikiDebate-18 sub-categories, which have the numbers 180, 136, and 403 for the discourse act, relation, and frame, respectively.

10https://github.com/sloria/textblob

(22)

(a) RfC-Predecessor (b) Webis-WikiDebate-18 (Act)

(c) Webis-WikiDebate-18 (Relation (d) Webis-WikiDebate-18 (Frame)

Figure 4: Sentiment distribution of utterances in RfC-Predecessor and Webis-WikiDebate-18 corpora Table 4: General comparison of Webis-WikiDebate-18 and RfC-Predecessor corpora

Feature Webis-WikiDebate-18 RfC-Predecessor

Discourse act Relation Frame

Average sentiment 0.067 0.059 0.072 0.064

Average words per utterance 25 19 56 36

Average utterance length (character) 180 136 403 253

Term frequency analysis We extract the most frequent words from each dataset to further analyze the textual content of the utterances. In this manner, we may assess how comparable the corpora’s themes are. We use the FreqDist library from NLTK [66] to count the most frequent words in each corpus. Figure 5 illustrates the 25 most frequent words of each corpus. The word ”article” appears the most frequently in both corpora. Due to the nature of Wikipedia talk pages, it is not a surprise to see discussions in both corpora following similar behavior. The use of frequently occurring verbs such as think, see, say, make, give demonstrates that discussions are frequently based on similar acts, which is evident from the way the Wikipedia talk pages function. If we exclude certain determiners (e.g. one, two, etc.), it is evident that the remaining words like article, source, reference, information, title, and section have a common theme in both corpora.

(23)

(a) RfC-Predecessor (b) Webis-WikiDebate-18 (Act)

(c) Webis-WikiDebate-18 (Relation) (d) Webis-WikiDebate-18 (Frame)

Figure 5: Most frequent words in RfC-Predecessor and Webis-WikiDebate-18 (discourse act, argumentative relation, and frame) corpora

Topic modeling To fully understand the similarities and differences in the context of the corpus, it is crucial to analyze the themes and topics of the datasets. This will enable us to train a model on one corpus and use it to make predictions on another corpus. We perform topic modeling to gain a more profound understanding of the context of the corpora. To this end, we employ the Non-Negative Matrix Factorization (NMF) method. To vectorize our tokens we use the TF-IDF (Term Frequency - Inverse Document Frequency) method, which helps to mitigate the impact of high-frequency words (like ”article”). To create the vectorizer object, we eliminate words that appear in more than 90%

and less than 5% of the utterances in the corpora. Then, we extract ten topics using the NMF module from Scikit-learn library [67] for each corpus. The topics and their distribution in corpora are illus- trated in figure 6. According to figure 6b the majority of discussions in the discourse act data talk about source, reference, and citation. ”sentences”, ”information”, ”modification” and ”title, image, or date” are other hot topics after ”citation”. The latter part of the graph shows that topics such as

”providing feedback”, ”joining discussion”, ”type of citation”, and ”Wikipedia article information”

were frequently discussed, but to a lesser degree than the previously mentioned topics. On the other hand, figure 6c shows that in over 60% of argumentative relation discussions, people talk about ”disambiguation” and ”discussion consensus”. Additionally, Around 20% of discussions contain topics of ”discussion policies”, ”long articles”, and ”merging” something. ”Proposal agreement”, ”support reasoning”, and ”Wikipedia official terms” are the least discussed topics in those conversations.

(24)

When comparing the above-mentioned topics with those of the Rfc-Predecessor corpus, it is evident that the themes are similar and in some instances identical. Nevertheless, the distribution of these themes is dissimilar. The topic that appears most frequently in the RfC-Predecessor corpus, account- ing for approximately 45% of all utterances, is ”article information”. The second and third most prevalent topics in this corpus are ”merging and separating articles” and ”disambiguation”, respectively. Interestingly, the themes of these topics are identical to some of the topics in the argumentative relation discussions. Other topics found in this corpus, with less prevalence, include ”editorial perspective”, ”images, maps, and galleries”, ”modifying discussions”, ”formal terms”, ”citations”, and

”requests for comment”. With the exception of the first and last topics, all the others are quite similar to those related to discourse acts and argumentative relations. In conclusion, it can be established that the themes and tone of the conversations in the Webis-WikiDebate-18 and RfC-Predecessor corpora are similar. As such, it is reasonable to utilize a model that has been trained on the former corpus to predict argumentative attributes of utterances in the latter corpus.

(25)

(a) RfC-Predecessor

(b) Webis-WikiDebate-18 (Act)

(c) Webis-WikiDebate-18 (Relation)

Figure 6: Topic modeling results for RfC-Predecessor and Webis-WikiDebate-18 (discourse act and argumentative relation) corpora

(26)

4 Methods

In this section, we outline the procedure we have implemented for conducting the research. Figure 7 depicts four steps that we take as our approach. The first step is Argumentative attribute classification, in which we build three classifiers to classify the argumentative attributes ( discourse act, argumentative relation, and frame) of discussion utterances. We train these classifiers using the labeled data available in the Webis-WikiDebate-18 dataset [3]. The second step is Discussion utterance classification, where we use the trained classifiers in the previous step to classify discussion utterances in the RfC-Predecessor pairs corpus [23]. At the end of this step we have labeled all of the utterances in three dimensions. Now, according to those labels, we can analyze successful and unsuccessful discussions to distinguish effective strategies in the discussions. This analyses take place in the step three, namely Deliberative strategies analysis. In this step, we investigate which patterns are most frequent in successful and unsuccessful discussions. To do that, we dive into most prevalent patterns in each of three dimensions (discourse act, argumentative relation and frame) separately as well as their combination in successful and unsuccessful discussions. Finally, in the last step, Binary classification, we use the discussion labels that are classified in step two, to train a binary classifier and predict the success of a particular discussion. In the subsequent sections, we discuss these four steps in more detail.

4.1 Argumentative Attribute Classification

To classify argumentative attributes, we use the ”pre-train and fine-tune” paradigm and ”prompt- based learning” techniques to build our classifiers. For fine-tuning, we use two popular pre-trained transformer-based language models, BERT (Bidirectional Encoder Representations from Transform- ers) [68] and RoBERTa (Robustly Optimized BERT Pre-training Approach) [69]. Due to the limited amount of labeled data in the Webis-WikiDebate-18 corpus, we also use the prompting paradigm to investigate if it can improve the performance of our classifier compared to the typical pre-training and fine-tuning paradigm.

Baseline

In this step, we employ the support vector machine (SVM) [70] algorithm as our baseline. SVM is a supervised learning approach that can be applied to classification or regression tasks. An SVM’s objective is to identify the ideal boundary (also known as a ”decision boundary”) between the various classes of data. To maximize the margin, or the distance between the boundary and the nearest data points for each class, this boundary was used (known as ”support vectors”). When the number of dimensions exceeds the number of samples and the space is high dimensional, SVMs perform very well. SVM has been demonstrated to be effective in text classification tasks [71], and has been shown to achieve the best performance in comparison to other classical machine learning algorithms in the Webis-WikiDebate-18 corpus, according to the study by Al-Khatib et al. (2018) [3]. As a result, this algorithm has been chosen as the baseline in the first and last steps of the current study.

(27)

Figure 7: General overview of four steps of the study.

(28)

BERT

In 2018, Devlin et al. [68] introduced BERT, the first bi-directional PLM. It overcomes the uni- directional limitations, adopting a ”masked language model” (MLM) task, which enables pre-training deep bidirectional representations. They also employed the ”next sentence prediction” task for pre- training pair representations simultaneously. The model was primarily pre-trained on the English Wikipedia (2,500M words), making it a great choice for our study. Additionally, they used BookCor- pus (800M words) [72] in the pre-training procedure. Since there has not been any study (to the best of our knowledge) to classify the argumentative attributes, namely discourse act, relation, and frame, we decide the first logical choice is to fine-tune BERT in this study.

BERT’s architecture is composed of a stack of Transformer encoders, in each of which there are feed- forward neural networks and self-attention heads. The first encoder in the stack takes a sequence of words as input and applies self-attention. The self-attention mechanism helps for better encoding of the token by looking at other tokens in the sequence. The results, then, are passed to the feed-forward network. The output, which is represented as a vector, is transmitted to the next encoder as input. The output of the final encoder layer is fed to a classifier with a softmax activation function.

Devlin et al. (2019) [68] present BERT model in two different sizes:

• BERT_BASE that consists of 12 Transformer blocks (encoders), a hidden size of 768, 12 self- attention heads, and total parameters of 110M.

• BERTLARGE which is composed of 24 Transformer blocks (encoders), a hidden size of 1024, 16 a self-attention heads, and a total parameters of 340M.

RoBERTa

In 2019, Liu et al. [69] developed a BERT replication to investigate the effects of tuning hyperpa- rameters and expanding training data on the model’s performance. They trained the model longer, over more data using only MLM pre-training approach. One important advantage of RoBERTa that convinced us to use it in this study is the size of the pre-training data. While BERT was pre-trained on 16GB of text data, RoBERTa has been pre-trained on over 160GB of text data [69], including English Wikipedia and BookCorpus (16GB) [72], CC-NEWS¹¹ (76GB), OpenWebText¹² (38GB), Stories (31GB) [73]. In this study, we fine-tune RoBERTa to find out how this huge difference in the size of training data would impact our classifiers. RoBERTa, being pre-trained on longer sequences, is an appropriate model for our task as it is well-suited to handle the longer texts commonly found in discussions.

Prompt-based learning

The major problem with supervised learning is that it requires supervised data for the task to train a model, and this data is frequently insufficient for many applications. The pretraining-finetuning paradigm is a common method to modify and adapt pre-trained language models for many NLP tasks. Fine-tuning PLMs has been proven to achieve superior results over the conventional approach of training a neural network model from scratch. However, there are several limitations to directly

11http://web.archive.org/save/http://commoncrawl.org/2016/10/newsdataset-available

12http://web.archive.org/save/http://Skylion007.github.io/OpenWebTextCorpus.

(29)

Table 5: Prompting examples for different classification tasks.

Task Input([X]) Template Answer ([Z])

Topics The face of the 2022 World Cup [X] The text is about [Z]. sports, science, ...

Sentiment I like the movie. [X] The movie is [Z]. great, good, ...

Intention Where is the pharmacy? [X] The question is about [Z]. quantity, place, ...

fine-tuning PLMs. First, fine-tuning a PLM requires a substantial amount of data and computational resources for each downstream task. Second, the prompting approach more closely mimics the way the human brain carries out natural language processing tasks, as humans often require additional, task-specific information to be provided at the start of a task. For example, if we want to know whether a review is positive, negative, or neutral from a human, we would prepare a question like

”Do you think the review is positive, negative, or neutral?” to prompt the human to accomplish the task.

Recently, prompt-based learning has experienced a resurgence in natural language processing, and it has been demonstrated to have a great deal of potential for overcoming the limitations of previous approaches. Brown et al., (2020) [74] indicated that developing a very large PLM with 175 billion tokens and prompting the PLM alleviates the need for additional data for fine-tuning. Thus, it allows us to perform zero-shot and few-shot learning for several NLP tasks. It has been shown that in low- data circumstances PLMs can show significant performance using prompting techniques [75, 74]. In this paradigm, the input text is modified using a textual template and is fed to a particular PLM to conduct the task.

Liu et al. (2021) [76] described the basic prompting process in three steps:

Prompt addition: In this step a prompting function is defined to pre-process the input text. This step consists of two processes:

1. Creating a template, which consists of some fixed extra tokens and two slots: input slot [X] for input text and answer slot [Z] for predicted output that will be used in the answer mapping step.

2. Filling input slot [X] with the input text.

The output slot [Z] could be either in the middle of the template (cloze prompt) or at the end (prefix prompt).

Answer search: In this step, the output slot [Z] in the prompt will be filled by a potential answer, which is the highest-scoring answer. When performing generative tasks, Z might be anything from the entire language, or it could be a limited subset of the language’s terms when performing classification tasks. For instance, an answer space for a sentiment analysis task could be z={”excellent”, ”good”,

”OK”, ”bad”, ”horrible”} each of which represents a class in y={++,+, ,-,–}. The location [Z] in the prompt template that has been defined in the previous step will be filled in using one of the answers in the z answer space. To this end, we search for the answer space z by using a trained LM to determine the probability that each answer would match its related question. Table 5 shows some examples of utilizing prompting in text classification tasks.

Answer mapping: Each potential answer has a corresponding output to be mapped to the classes in y. In some tasks like machine translation, the output is the answer itself, but in some other cases, it is necessary to map each of the answers from z to the corresponding one in y.

(30)

4.2 Discussion Utterance Classification

The next step is to predict the argumentative attributes of the RfC-Predecessor dataset using classifiers built in the previous step. Analyzing the performance of the models that we trained over Webis-WikiDebate-18 corpus, we choose the most suitable one for the task and apply it to the RfC- Predecessor dataset to predict discourse act, argumentative relation, and frame attributes of the corpus. At the end of this step, we have a corpus consisting of 10750 utterances labeled in the three dimensions.

4.3 Deliberative Strategies Analysis

When we assigned the three argumentative attributes to each utterance of discussion pairs in the RfC-Predecessor corpus, the next step is to investigate the most frequent patterns in both sides of discussion pairs to find common conversational trajectories. Mining common patterns in productive conversations will provide cognitive insights into advantageous pattern sequences. On the other hand, recognizing conversational derailment traits would benefit from identifying the most prevalent patterns in failed dialogues. Sequential pattern mining techniques are effective for identifying recurring patterns in our dataset, where each discussion contains a sequence of classified utterances. More precisely, Sequential pattern mining involves finding all common subsequences in a set of sequences, where each sequence consists of a list of elements, and each element is made up of a set of items.

The process involves setting a minimum support threshold, which is used to determine which subsequences appear frequently enough in the set of sequences to be considered ”common” [77]. There are two different categories of algorithms in this domain: Apriori-based algorithms like GSP [78], where the sequence database is scanned multiple times, making it inefficient to employ them for huge datasets. Non-Apriori-based algorithms such as FreeSpan [79] and PrefixSpan [80]. Compared to the GSP algorithm and FreeSpan, PrefixSpan efficiently mines the entire collection of patterns and operates much more quickly [80].

In this study, considering the three classified attributes in our dataset as the sequence database, we choose the PrefixSpan algorithm [80] to identify patterns in the discussions for several reasons.

Firstly, the PrefixSpan algorithm only considers patterns that exist in the database. In other words, unlike some other algorithms such as GSP [78], it does not combine patterns to find new patterns that do not exist in the database. Consequently, it is much faster than some other algorithms that combine patterns. Secondly, the algorithm uses database projection which makes the database smaller in each step. Moreover, the algorithm is efficient (but not the most efficient), because, unlike Apriori-based algorithms, PrefixSpan does not scan sequence databases multiple times. Finally, the algorithm is simple and easy to use by accessible libraries in Python.

After identifying the most frequent patterns on both sides of the successful and unsuccessful discussions, we do a thorough study to identify effective and productive strategies.

4.4 Binary Classification

The final step is building a binary classifier to predict whether a discussion, which is labeled in previous steps, is successful or not. To achieve this, we create a model that makes predictions based on the entire discussion text as well as the 3-dimension labels predicted in the previous step. One short- coming is that because the self-attention operation used by most Transformer-based models scales quadratically with sequence length, they are unable to digest lengthy sequences. For instance, the