Conversation Modeling in Offensive Language Detection

(1)

MSC

ARTIFICIAL

INTELLIGENCE

M

ASTER

T

HESIS

Conversation Modeling in Offensive

Language Detection

by

R

ISHAV

H

ADA 12137529

August 20, 2020

48 ECTS Nov 19 - Aug 20 Supervisor: Dr. Ekaterina SHUTOVA Assessor: Dr. Giovanni COLAVIZZA Cosupervisors:

Dr. Saif M. MOHAMMAD(National Research Council Canada)

Dr. Helen YANNAKOUDAKIS(King’s College London)

Mr. Pushkar MISHRA(Facebook AI, London)

(2)

(3)

iii

Abstract

The prevalence of unprecedented hate speech, offensive, and abusive language on social media platforms has made it an important societal problem of the present time. Offensive language has several undesirable psychological effects on individ-uals. Such offensive language can make social media platforms counter-productive and have a negative impact on the well being of the community. In order to safe-guard the interest of its users, social media platforms attempt to mitigate offensive content from their platforms. Consequently, over the past few years, there has been a growing interest in the natural language processing research community, to build automated offensive language detection systems.

In this thesis, we explore the domain of offensive language detection. Most ex-isting datasets in this domain have coarse grained labels for comments. However, offensive comments are highly contextual in nature and vary in their degree of offen-siveness. We create a dataset of comments taken from Reddit. Using the Best–Worst

Scaling setup for annotations, we provide each comment in our dataset with a score

that represents its degree of offensiveness. In the dataset, for each comment, we pre-serve the conversation it occurs in. We build computational models to predict the degree of offensiveness for a given comment. We study how conversation as context can be helpful for modeling offensive language. In particular, we use an attention mechanism to model a part of the conversation along-with the comment in consid-eration. Finally, we provide an analysis of the results achieved, and examine the role of context in offensive language detection.

(4)

(5)

v

Acknowledgements

To say that working on this project was a challenge over the last one year would be an understatement.

We all have and still are living through trying times throughout the world ﬁght-ing battles that have challenged us on a personal level and have changed us I believe for the better.

Foremost, I would like to thank Katia for giving me such a wonderful opportu-nity. You have truly been a beacon of hope and encouragement for me personally. We faced many challenges during the course of 10 months, and you have been pa-tient throughout. Your kind words would always go a long way. When stuck, I would eagerly wait for our meetings on Friday, as I knew you would have exactly the right words to say to cheer me up and be hopeful!

Saif, Helen, and Pushkar thank you for your time and patience. This project would not have been possible without your constant support and guidance.

I would like to thank all of you for being there by my side in all things - big and small, in both a personal and professional capacity.

I am thankful to Dr. Giovanni Colavizza for agreeing to assess my thesis.

My feelings of unending gratitude for my parents cannot be expressed in words. Thank you for showing me your kindness - more so over the last six months. I really appreciate every single thing that you have done for me till date. It truly makes me emotional to think about it. Thank you for believing in me and being supportive of all my endeavours. Your words of encouragement mean a lot to me.

I can’t thank my elder brother enough. Thank you for making my dream of doing Masters abroad come true. Thank you for being the protective elder brother you are. I do not have to worry about anything while you being there. If I need anything, all I need to do is to ask you for it. You would go lengths to make it possible for me. Thank you for tolerating all my annoying habits!

Any acknowledgement is incomplete without the mention of friends. I have an amazing bunch of friends who have kept me sane over the years. I would like to thank each one of you individually. Amish, Dhiman, Revati, Samarth, Savvina, Sohi, and Tanvir – Thank you for being there and sticking around for the ups and downs in my life. This wouldn’t have been possible without you.

Lastly, a big shout out to my colleagues in the Masters program. Thank you for being helpful throughout.

My heartfelt gratitude to my grandparents for showering me with your blessings from wher-ever you are.

(6)

(7)

vii

Introduction

In recent years, social media platforms have established themselves as a part of our everyday lives. Social media has become not only a major means of communication between friends and family, but also, a medium for people to present their views and opinions on various topics ranging from politics to sports and others. Social media connects people from various backgrounds having shared interests. With diversity of people comes varied opinion. Sometimes, debating over a certain topic results in a heated discussion due to difference in opinion, often, leading to use of slur words, hate speech, abusive, and offensive language.

Such offensive language make threads and discussion counter-productive (Salmi-nen et al.,2018). A strong connection has been found between hate speech and actual hate crimes (Waseem and Hovy,2016). Hate speech on social media many-a-times lead to hate crimes. Therefore, it is important for the social media platforms to iden-tify these posts and conversations early in time and either ﬂag them or ban them completely so that they do not lead to something grave. Platforms such as Face-book, Twitter, Reddit have modiﬁed their community guidelines over the years, to prevent the use of such offensive language, thus prohibiting the use of their plat-forms for attacks on people based on characteristics such as race, ethnicity, gender, sexual orientation, and others. Currently such moderation requires human analysis of posts and comments. This makes the process extremely time consuming for the amount of data to be reviewed by humans. (Waseem and Hovy,2016).

Modern day natural language processing (NLP) computational models are built with the intent of understanding the intricacies of human languages, so that they can aid humans with such redundant yet complex tasks. There are several challenges in the domain of automatic detection of offensive language (Wiedemann et al.,2018). First, there is no formal definition of offensive language due its subjective nature. So we adopt a definition, combining the various components of offensive language while making it inclusive of various aspects. Thus, we define offensive language as comments that include but are not limited to being [hurtful (with or without the usage of abusive words)/ intentionally harmful/ improper treatment/ harming the “self-concept” of another person/ aggressive outbursts/ name calling/ anger and hostility/ bullying/ hurtful sarcasm]. Second, the task cannot just be dependent on a certain list of abusive keywords (explicit abuse). Waseem (2016a) describes explicit abuse as language that is unambiguous in its nature to be potentially abusive. Such language clearly exhibits the use of slur words. For example – “go kill yourself”, “Youre a sad little f*ck” (Van Hee et al., 2015). Computational models should be able to identify offensive language in sentences where no such words are used, but can still be offensive (implicit abuse). Waseem (2016a) describes implicit abuse as language that is ambiguous in its nature to be potentially abusive. Such language lacks the use of slur words but is offensive due to the sarcasm or other factors. For example – “youre intelligence is so breathtaking!!!!!!”, “most of them come north

(10)

2 Chapter 1. Introduction and are good at just mowing lawns” (Dinakar, Reichart, and Lieberman,2011). Such cases of implicit abuse when presented in isolation, can be hard to judge for humans as well. This makes offensive language highly dependent on sentence and discourse level semantics. The conversation in which implicit abuse occurs can be helpful to judge whether a comment is offensive.

The representation of offensive class in a dataset is often boosted using keyword based sampling strategies. As a result, existing datasets in the domain of offensive language are rich in explicit abuse and lack cases of implicit abuse (Wiegand, Rup-penhofer, and Kleinbauer,2019). The present computational models for the task of offensive language detection are limited in their ability to successfully detect implicit abuse. This is due to the drawbacks of the dataset they use. We create an offensive language dataset with ﬁne-grained degree of offensiveness score. The dataset is very different from all the existing datasets in the domain of offensive language. The ex-isting datasets classify comments in discrete classes, whereas, we represent degree of offensiveness with ﬁne-grained scores. In our dataset, we facilitate the presence of implicit abuse in various ways. Some novel features of our dataset are:

• We do not use a list of abusive or hateful keywords for sampling comments.

We instead sample comments which exhibit strong emotions. We sample from an online community page called ChangeMyView, which encompasses mul-titude of viewpoints on various topics. Conversations in the ChangeMyView page are argumentative in nature as members challenge the viewpoints of oth-ers.

• We provide the original context for each comment, which can be helpful for

better interpretation, especially required in cases of implicit abuse.

• Lastly, in our dataset, comments with varying degrees of offensiveness are not

binned together. The ﬁne grained degree of offensiveness score should help the computational models learn more nuanced features from each comment. On social media platforms, offensive comments rarely occur in isolation. They are often preceded or succeeded by other offensive comments. Offensive comments develop within a conversation. Zhang et al. (2018) show that conversations exhibit early traits of turning impolite at a later stage. In other words, the conversation in which a comment occurs plays a vital role in determining the offensiveness of a comment. Therefore, it is important to model the offensive comments in conversa-tion rather than in isolaconversa-tion. This motivates our research.

1.1 Research Questions and Motivation

In this thesis, we aim to study different aspects of offensive language vital for com-putational models. Our main research questions are:

1. Given the subjective nature of offensive language, is it better to represent the offensiveness of a comment with a score in continuous domain rather than with discrete classes?

Utterances can be offensive in varying degrees. The existing datasets in this domain indicate only if a comment is offensive or not. Whether a comment is offensive or not is highly subjective to an individual’s opinion. Therefore, knowing the degree of offensiveness of a comment is important. Using com-parative annotations, we create a dataset of comments with real valued score

(11)

1.2. Contributions 3 in a continuous domain. We explore the dataset by analysing certain samples for which the offensiveness class is already known.

2. Does modeling conversation help the task of offensive language detection? Offensive language is inherently contextual (Gao and Huang,2017). Individ-ual comments can be incorrectly classiﬁed without proper context. Therefore, it becomes extremely important to account for context in this task. Context is especially useful when dealing with implicit abuse. We experiment with con-versation modeling deep learning (DL) architectures to investigate the impact of context in offensive language detection.

3. What beneﬁts the task of offensive language detection more - word level attention or sentence level attention?

The existing datasets show bias towards certain keywords, due to their sam-pling strategy. The keyword based samsam-pling strategy overlooks the presence of implicit abuse in the online communities. The offensive language detection models, as a result, do not perform well on cases of implicit abuse. Identifying implicit abuse is more difﬁcult even for humans due to the use of sarcasm and lack of hateful terms. Sentence level attention over context should help capture the cases of implicit abuse. On the other hand, word level attention should capture explicit abuse in the immediate neighbourhood of the comment. We examine which of the two is more helpful for offensive language detection.

1.2 Contributions

In this thesis, we explore the domain of offensive language, build a dataset with ﬁne grained degree of offensiveness score for a comment and evaluate computational models on this dataset. The dataset we create overcomes the key limitations of the existing datasets in that, it does not exhibit any strong bias and contains context for each comment. We discuss the iterative procedure of creating our dataset. We model conversation using different deep learning architectures. From our experiments, we learn that context helps the task of offensive language detection. We analyse the cause of improvement in performance when using the conversation models. We perform an additional experiment to provide subsequent comments as context and study its effect on the offensive language detection task.

1.3 Thesis Structure

Chapter 2: This chapter provides background and related work. It explains the key concepts used throughout the thesis. It shows the current trend in research in the domain of offensive language.

Chapter3: In this chapter, we describe the entire data annotation procedure over 5 successive pilots, followed by the main round of annotation. We discuss the chal-lenges faced in each pilot. We describe the comments sampling procedure for our dataset and explain the comparative annotation setup used. Finally, we analyze the quality of annotations received and the methods we adopt to improve it.

(12)

4 Chapter 1. Introduction Chapter4: Here, we discuss the computational models that we build. We train the computational models on our dataset. We build variants of context aware models and report how they perform in comparison to our baseline model. We discuss var-ious settings for the experiments, performance values, and qualitatively analyse the results.

NOTE ON COLLABORATION: The dataset was created in collaboration with my colleague, Sohi Sudhir. Sohi is also working on her thesis under the supervision of Dr. Ekaterina Shutova.

(13)

5

Chapter 2

Related Work

Natural Language Processing (NLP) based systems have found their applications in various domains like machine translation, dialogue generation, and sentiment anal-ysis amongst others. There’s been a shift in methods used to tackle NLP problems, from shallow machine learning models with hand-crafted features to deep neural models using word embeddings. The success of word embeddings has led to the development of better performing computational models.

Word embeddings are vector representation of words in high dimensional latent spaces. Using word embeddings, semantic relationships among words can be cap-tured. There are several existing word embeddings like word2vec, ELMo, GloVe and others. One of the first approaches to tackling NLP tasks with deep learning was by using convolutional neural network (CNN) (Collobert and Weston,2008). However, CNNs lacked the ability to capture long term sentence dependencies. This shortcom-ing was addressed with the advent of recurrent neural networks (RNN). RNNs are very effective at processing sequential information by recursively applying a com-putation on an input sequence conditioned on previously computed results. Due to long chains of computations, RNNs suffer from the problem of vanishing/exploding gradient. Variants of RNN such as long short-term memory (LSTM) networks, resid-ual networks (ResNets), and gated-recurrent networks (GRU) overcome this limita-tion. LSTMs have been widely used for an array of NLP tasks. A key improve-ment over LSTMs was the use of attention mechanism. It allowed models to learn which specific parts of the input data to attend to. This led to the next generation of language models with attention mechanism as its backbone. The models known as transformers (Vaswani et al.,2017), unlike LSTMs and other RNNs, consider an entire sequence at the same time. They employ attention mechanism to weigh the influence of each point in the input sequence. Transformer models were a huge suc-cess. Slowly they started replacing LSTMs in various tasks. Devlin et al. (2018) use transformer models to build a language representation model called Bidirectional Encoder Representations from Transformers (BERT). BERT is a trained transformer encoder stack. BERT, when fine-tuned for various NLP tasks, gave state of the art performance values. Even without fine-tuning, pre-trained BERT can be used to create contextualized word embeddings.

2.1 Existing Offensive Language Detection Datasets

Despite the growing need for automatic moderation of online content and other practical applications, NLP research on offensive language has been limited primar-ily by the lack of a universal deﬁnition for offensive language (Waseem and Hovy,

2016). The subjective nature of offensive language makes it a very challenging task for research (Founta et al., 2018). There exist a few datasets annotated for offen-sive language. These datasets comprise data from various platforms like Reddit,

(14)

6 Chapter 2. Related Work Facebook, Twitter, ask.fm and more. Datasets vary in their composition. They fo-cus on particular aspects of offensiveness like hate speech, racism and sexism, or personal attack (Mishra, Yannakoudakis, and Shutova, 2019). Datasets also differ in the process of sampling comments. Although highly prevalent, offensive com-ments form just a small fraction of the total number of comcom-ments. This makes it very hard to sample enough comments with offensive language for a dataset. Ide-ally, we want to have a balanced dataset with equal representation for all classes for the computational models to learn more effectively. However, this is unrealis-tic, given the distribution of offensive language. Therefore, researchers often try to boost the representation of the under-represented class in their dataset via certain techniques which introduce bias. In this section we review a handful of existing datasets – discuss their process of annotation, points of strength and shortcomings.

Waseem and Hovy (2016) created a dataset of 16K tweets annotated for hate speech. Each tweet is annotated for sexism, racism or neither. They conduct various tests on the dataset to analyze the features that improve the detection of hate speech. They bootstrap their corpus collection by an initial manual search of common slurs and terms targeted against certain minority groups. In the resultant tweets that contained hate speech, frequently occurring and referenced terms (such as #MKR) alongwith a small number of proliﬁc users were identiﬁed. This sample was then used to collect tweets which formed their dataset. As can be seen, this procedure of data collection introduced keyword and author bias in their data set. Each tweet was annotated manually by the authors and an outside annotator. Inter annotator agreement was reported atκ = 0.84. In cases of disagreement they noted that it is reliant on the context or its lack thereof, highlighting the importance of taking con-text into account for this task. The above dataset was extended by Waseem (2016b). This dataset consists of 6K tweets with 2876 overlapping tweets from the Waseem and Hovy (2016) dataset. Waseem (2016b) studies the importance of annotation of hate speech by experts in the domain. The annotators were divided in three annota-tion groups – expert, amateur majority and emphamateur full. They note that distri-bution of annotation by amateur majority comes closest to the Waseem and Hovy,

2016dataset and differ greatly form amateur full and expert. Another label for tweets which are both sexist and racist was added. The expert annotators were feminists and anti racism activists. The agreement of this dataset between all annotators group and Waseem and Hovy,2016dataset annotations is extremely low atκ=0.14.

Davidson et al. (2017) created a dataset of 24K tweets, each annotated as hate speech, offensive but not hate speech, or neither offensive nor hate speech. Each tweet in the dataset was annotated by 3 annotator and the majority class was used. This data set also has keyword bias since it uses a list of hate words from www. hatebase.orgto sample the tweets. Later the authors also found out that only 5% tweets were actually hate speech, indicating the imprecision of the list used by them. The authors note that future work should distinguish between the different uses of hate words and slurs and look more closely at the social contexts and conversations in which the hate speech occurs.

The above works have an important gap regarding the principled selection of the most appropriate labels for annotating aggressive online behavior (Founta et al.,

2018). Past studies do not explain the selection of types of inappropriate speech which they use for their annotations. Addressing these issues, Founta et al. (2018) created another dataset using an iterative methodology allowing for controlled sta-tistical analysis of label selection by annotators. They use the ﬁnal labels for crowd sourced annotation of 80K tweets. In order to annotate the tweets with most appro-priate labels the authors use preliminary annotation rounds to identify the nature of

(15)

2.1. Existing Offensive Language Detection Datasets 7 confusion, thereafter eliminating any ambiguity on labels used for the main annota-tion, thus achieving higher accuracy and consistency. The tweets were annotated for

abusive, spam, hateful, and normal. To overcome the problem of keyword bias as

preva-lent in the datasets mentioned above, the authors propose a boosted random sam-pling technique wherein a large part of the dataset is randomly sampled followed by boosting of tweets that likely belong to one or more of the minority classes. Text analysis and preliminary crowdsourcing rounds were used to design a model that can pre-select tweets of the boosted dataset. Finally both sets are mixed together for crowd sourcing annotation. To overcome the problem of subjective opinion of offensive language the authors used up to 20 annotators for exploratory rounds of annotation. This established the general level of agreement to be expected. As a re-sult of the methodologies adopted, the authors get a rich abusive language dataset with high inter annotator agreement for the tweets.

Slowly the NLP research community, started moving towards fine grained anno-tation of offensive language data like identifying the target of abuse or identifying the various parties involved and the specific type of offensive language used. Iden-tifying the target of abuse can give some interesting insights about the user or a community page. It can help us identify the nature of some community pages. For example, on Reddit there were multiple pages with the explicit purpose of post-ing and sharpost-ing hateful content (r/CoonTown, r/FatPeopleHate, r/beatpost-ingwomen) (Saleem et al.,2017). In a first, Zampieri et al. (2019) make a dataset with the iden-tified abuse target. They study hate speech with respect to a specific target. For nuanced annotation they adopt a three-level hierarchical annotation schema where each tweet is tagged:

• as offensive or not

• whether the offence is targeted or not

• and whether it targets an individual, a group or otherwise.

The dataset consists of 14K tweets retrieved from the twitter API using keywords and constructions that are often included in offensive messages, which again intro-duces keyword bias.

A dataset was made by Van Hee et al. (2015) on data collected from Ask.fm, which is a site where users can ask and answer questions to each other. The dataset consists of 85,485 Dutch posts. The authors of this paper wanted to provide the annotators with context, hence all posts were presented within their original con-versation when possible. They follow a hierarchical annotation scheme. Each post is first marked with a number between 0-2. Where 0 indicates post does not contain indications of cyberbullying, 1 that the post contains indications of mild cyberbul-lying, and 2 that the post contains indications of serious cyberbullying. If the post is indicated as cyberbullying, then the role of comment’s author is identified, i.e. if the person is a harasser, victim or bystander. Bystanders are further identified as supporters of victim or the harasser. Lastly, the annotators had to mark throughout the text span which sentence belonged to which cyberbullying category from the given list of categories – threat, insult, curse, defamation, sexual talk, defense, en-couragement to the harasser. The authors provide a fine grained annotation. Only two annotators were used for the task. The dataset however, is heavily skewed, with only 6% of the total posts being actually a cyberbullying event (posts with score 1 or 2). It was noted that insults were the most common type of cyberbullying followed by defense and curse. They note that for fine grain classification, data sparsity can be one of the main challenges.

(16)

8 Chapter 2. Related Work Salminen et al. (2018) released a dataset with a fine grained taxonomy. They col-lected 137K comments from major online news and media company with an inter-national audience on YouTube and Facebook. In the sample, they identify the most frequent abusive words and prepare a list of 200 such words. Using this list they extract approximately 22K hateful comments. Further they run a LDA based topic model on the dataset and identified 10 topics which were labelled by researchers. Using open coding the authors came up with a taxonomy of 13 categories and 16 sub categories. The categories include the type of language and the target. The com-plete taxonomy is shown in figure2.1.

FIGURE2.1: Hate target taxonomy. Image taken from Salminen et al.,

2018

Biases like author bias, topic bias, and keyword bias are prevalent in the existing datasets as a result of the various focused sampling strategies applied by different authors in order to increase the abusive content in their datasets (Wiegand, Ruppen-hofer, and Kleinbauer,2019). Procedure of sampling followed by Founta et al.,2018

comes closest to random sampling. Other datasets have topic bias as they start by sampling from keywords or certain online communities for particular topics where they expect abusive content to be present in abundance. For example, Waseem and Hovy (2016), Davidson et al. (2017) use a list of hate keywords taken from a website. The technique of keyword based sampling, results only in explicit abuse, while the much more difﬁcult implicit abuse remains undetected. Wiegand, Ruppenhofer, and Kleinbauer (2019) identify that the sampling procedure of the Waseem and Hovy,

(17)

2.2. Computational models 9

2016dataset introduced a bias for women in sports. Few words which have high-est Point-wise Mutual Information (PMI) towards abusive tweets are: commentator, football, announcer, sport and others. As a result, when a classifier was trained on the dataset and tested on new data of tweets that talk about football, it predicts 70% of the data to be abusive which in reality was only 5%. Therefore, classifiers trained on Waseem and Hovy,2016dataset do not generalize well to abusive language. Ad-ditionally, on removing the high PMI words which are not abusive in nature (e.g. commentator, sports) and the certain keywords used to create the dataset, it was noted that the performance of the classifier decreases considerably. Many datasets suffer from another bias described as the author bias. Author information can be extracted explicitly or be learned by classifiers implicitly from the data we provide. In cases of author bias the classifier learns particular authors from whom the abu-sive data is mainly coming from. For example, it was identified that the Waseem and Hovy,2016 dataset is highly skewed towards 3 authors. Now every time the classifier would classify tweets from these authors to be abusive even if they are not. Removal of such biases from datasets is an important task to improve the ro-bustness of models. Park, Shin, and Fung (2018) report several methodologies to remove gender bias from datasets, like using debiased word embeddings, swapping the gender and finally bias fine tuning. Badjatiya, Gupta, and Varma (2020) discuss some principled approaches for debiasing a dataset, specifically the stereotypical bias. Stereotypical bias is an over-generalized belief about a word being hateful or neutral. The datasets described above do not provide the context in which the abu-sive comment appeared. Context is very important for offenabu-sive language detection task, because of the inherent contextual nature of such comments (Gao and Huang,

2017). Individual comments can be wrongly classiﬁed without proper context. Gao and Huang (2017) created a dataset of 1528 Fox News user comments with context. They include the entire thread, the original post of the comment and the user name of the person who posted it. Later they use context-aware models and note improve-ment on their dataset when using context. Qian et al. (2019) created another dataset with context. Their datasets consist of 5K conversations retrieved from Reddit and 12k conversations retrieved from Gab. The comments are annotated for hate speech and not hate speech categories. They also introduce human-written intervention responses to build generative models to automatically mitigate the spread of these types of conversations. They sample comments from toxic community pages. Com-ment threads with less than 20 comCom-ments were annotated. Due to their comCom-ments sampling strategy, 76.6% of the conversations from Reddit and 94.5% conversations from Gab contained hate speech.

2.2 Computational models

The existing datasets vary greatly in their composition. Over the course of devel-opment, offensive language datasets have moved from a binary label to more ﬁne grained labels. Mishra, Yannakoudakis, and Shutova (2019) note that most datasets contain discrete labels only. According to the type of the dataset, there are computa-tional models with varying approaches to tackle the task. In this section we discuss a few approaches.

The earliest approaches to abusive language detection were feature engineering based approaches. On a linear support vector machine (SVM) classiﬁer Yin et al. (2009) used 3 types of features – local features, sentiment features and contextual

(18)

10 Chapter 2. Related Work features. Local features were TF-IDF weights of words in the post, sentiment fea-tures were TF-IDF weights for pronouns and foul language and lastly, contextual feature was a similarity value of a post with its neighbouring post. They use simi-larity because vast majority of the dataset would be non-harassment posts, therefore harassment posts should be very dissimilar to the cluster of posts they occur in. They conduct experiments on three different datasets: Kongregate, Slashdot, MySpace and report F1 values of 0.442, 0.298, and 0.313 respectively. Finally, the authors press upon the success of using contextual features as opposed to using local features alone. Other works include using a lexicon based abuse detection system (Razavi et al., 2010; Njagi et al., 2015; Wiegand et al., 2018), Bag of Words (BoW) features (Sood, Antin, and Churchill, 2012; Warner and Hirschberg, 2012; Van Hee et al.,

2015) and user proﬁling (Dadvar et al.,2013; Galán-García et al.,2014). In user pro-ﬁling the models are provided with certain information about the users like their age, username, geo-location, sex, and others. However, due to privacy policy of various platforms, gathering user information is not always possible.

With the creation of word embeddings like GloVe, ELMo, word2vec several neu-ral network based deep learning approaches were proposed to achieve better per-formance on the offensive language detection task. Mishra et al. (2018) built a model which used GLoVe, LSTMs and author proﬁle. They report a F1 of 0.87 and de-scribe the role of using author proﬁle for achieving performance gains. They use author information by creating an undirected unlabeled community graph wherein nodes are the authors and edges are the connections between them. This graph is then turned into an author embedding by using the node2vec framework. Other deep learning approaches include using CNNs(Park and Fung,2017), gated recur-rent units (GRU) (Wang,2018; Zhang, Robinson, and Tepper,2018), transfer learning (Wiegand, Siegel, and Ruppenhofer,2018), and more.

2.3 Conversation Modeling

Conversation modeling is one of the major challenges in NLP. Conversation consists of structured and coherent groups of sentences instead of isolated and unrelated ones. These coherent groups of sentences are referred to as discourse. Via conver-sation modeling we aim to uncover linguistic structures from texts at several levels. We want NLP models to understand the conversation so that they can determine the connections between utterances. Conversation modeling have wide applications in the NLP tasks like text generation, summarization, and dialogue modeling. Con-versation modeling is also important in sentiment analysis tasks where we want to process large documents or conversations. Attention mechanism have been very successful in modeling context in language. They have led to the development of the present day state-of-the-art encoder-decoder models, the transformers (Vaswani et al.,2017). An attention mechanism looks at the input sequence at each time step and gives more weight to the important words in the sequence, just like how humans read a sentence and tend to retain the important words. Attention mechanism was proposed by Bahdanau, Cho, and Bengio (2014). The central idea behind attention is to use the intermediate encoder states to construct context vector for the decoder. Several modiﬁcations and updates to the basic attention mechanism were proposed which are better suited for certain tasks like hierarchical attention network (HAN) by Yang et al. (2016), bidirectional HAN by Remy, Tixier, and Vazirgiannis (2019) for document classiﬁcation, and multi-head attention by Cordonnier, Loukas, and Jaggi (2020) for machine translation.

(19)

2.3. Conversation Modeling 11 Some researchers highlighted the importance of modeling context for abusive language detection as abusive language is inherently contextual (Gao and Huang,

2017; Waseem and Hovy,2016; Qian et al.,2019). In the existing hate speech datasets and models context information has been severely overlooked. Gao and Huang (2017) describes context as text, symbols or any other kind of information related to the original text. Gao and Huang (2017) explored logistic regression and neural network models that incorporate contextual information. Their neural net models contain separate learning components that model compositional meanings of con-text information. Their evaluation conﬁrms that concon-text aware models indeed out-perform the existing models for the task. For logistic regression model four kind of features were extracted: word-level, character-level n-gram features and two types of lexicon derived features. The lexicon derived features are the Linguistic Inquiry – Word Count feature and the NRC emotion lexicon feature which capture emotion clues in text. By taking in these features they note a limited improvement in per-formance. For neural network model there were three different inputs: target com-ment, its news title and its user-name. An attention mechanism was used to model the comment text alongwith the context. They note that the attention mechanism signiﬁcantly improves the hate speech detection performance.

Chakrabarty and Gupta (2018) implemented a Context-Aware Attention based model for understanding Twitter abuse. They show that context aware attention helps in focusing on certain abusive keywords when used in speciﬁc context and improve the performance of abusive behavior detection models. Their model uses a BiLSTM unit because of their ability to capture long-term dependencies, coupled with a attention mechanism which would give weights to each word that is obtained from the BiLSTM layer. Experiments were conducted on three relevant datasets and they noticed much robust and better performance. Chakrabarty, Gupta, and Muresan (2019) discuss how contextual attention works better than self-attention particularly when it comes to modeling implicit abusive content. The key differ-ence between contextual attention and self attention is that the former uses word level context vector that is randomly initialized and jointly learned during the train-ing process. They observed that contextual attention models outperforms the self-attention models for both simple and stacked architectures. The author justify this performance gap with the fact that the context vector can be treated as global impor-tance measure of words in text because it takes into account which word to attend to based on how that word has been used in different contexts in the training set.

(20)

(21)

13

Chapter 3

The Offensive Language Dataset

Owing to the emerging importance of the task, there is a growing interest in the ﬁeld of offensive language detection. Datasets like the ones proposed by Spertus (1997), Djuric et al. (2015), Waseem and Hovy (2016), Waseem (2016a), Davidson et al. (2017), Founta et al. (2018), Golbeck et al. (2017),Wulczyn, Thain, and Dixon (2016) and Salminen et al. (2018) are the frequently used ones among others. All the existing datasets provide discrete labels to the text instances ({Covertly, Overtly, Non}-aggressive, Abusive-Spam- Hateful-Normal and so on). In other words, they provide coarse categorical annotations only. There are a few datasets which cate-gorize the degree of offensiveness of a particular comment into discrete classes. In fact, many works have made a simplifying assumption to make this a binary label task. But comments can be offensive in continually varying degrees. In contrast, we create a dataset that captures the degree of offensiveness. Knowing the degree of offensiveness has practical implications as it gives detailed insights about the com-ment.

We create a labeled dataset for offensive language by crowd-sourcing. As stated by Founta et al. (2018), human labeling poses certain challenges when dealing with offensive language. In our dataset each comment is annotated for the degree of offensiveness using a technique called Best–Worst Scaling (BWS). BWS is an annota-tion scheme that addresses the limitaannota-tions of tradiannota-tional rating scale methods, such as inter and intra-annotator inconsistency, by employing comparative annotations (Louviere, 1991; Louviere, Flynn, and Marley, 2015; Kiritchenko and Mohammad,

2016; Kiritchenko and Mohammad, 2017). In the BWS annotations scheme, the an-notators are provided with an n-tuple (where n> 1, and commonly n = 4), and asked which item is the best and which is the worst (best and worst correspond to the highest and lowest property of interest). Best–worst annotations are especially proﬁcient when using 4-tuples, as each annotation results in inequalities for 5 of the 6 item pairs. For example, for a 4-tuple with items A, B, C, and D, if A is the best, and D is the worst, then A>B, A>C, A>D, B>D, and C>D. We can calculate real-valued scores of associations between the items and the property of interest from the best–worst annotations for a set of 4-tuples (Orme, 2009; Flynn and Marley, 2014). The scores can be used to rank items by the degree of association with the property of interest.

Much of the previous work in the domain of offensive language detection, uses data from Twitter. However, since tweets are very short texts not providing a lot of context, computational models built on these datasets do not generalize well (Wie-gand, Ruppenhofer, and Kleinbauer, 2019). Our dataset comprises of comments from Reddit. Reddit is a popular social media platform with wide user base. It is a social news aggregation, web content rating, and discussion website. Registered members submit content to the site such as links, text posts, and images, which are

(22)

14 Chapter 3. The Offensive Language Dataset then voted up or down by other members.1 Reddit being a discussion forum is rich in textual data. Users on Reddit discuss a multitude of topics including entertain-ment, personal interests, politics, global news, and others. Offensive and foul lan-guage on the platform, would discourage participation from people and hurt their sentiments. An added advantage of using Reddit data is that it preserves conversa-tion in the form of a thread which can be helpful for identifying offensive language. In this thesis, for the ﬁrst time we create an offensive language dataset of com-ments aggregated from Reddit with labels being offensiveness scores. For a given comment, the score ranges from -0.542 to 0.521. A score of 0.521 means that the comment conveys the highest degree of offensiveness. A score of -0.542 means that the comment conveys the lowest degree of offensiveness. Our dataset contains 5,135 comments from the ChangeMyView (CMV, 2) subreddit (community) from Reddit. CMV is a discussion forum on Reddit, where users post and comment on contro-versial and debatable topics. This community garners attention from over a million users coming from diverse backgrounds. Discussion on challenging viewpoints of-ten derails into members using offensive language. Therefore, we expect the data to be rich in implicit and explicit abuse.

We adopt an iterative process to build the dataset. We conduct multiple pilot rounds on small number of samples. In each successive pilot round, we adopt cer-tain measures to mitigate the drawbacks and the problems faced in the previous round. This chapter is divided in three parts. We first discuss the theory behind BWS which we employ for the annotation of our dataset. This is followed by our pilot rounds, specifically the key lessons learnt in each and how they inform the process of our final dataset creation. Then we proceed to discuss more about the dataset itself and the methodology used.

3.1 Theory

When using a rating scale setup, a common problem is inconsistencies in annotations among different annotators. One annotator might assign a score of 7.9 to a piece of text, whereas another annotator may assign a score of 6.2 to the same text. It is also common that the same annotator assigns different scores to the same text at different points in time. Further, annotators often have a bias towards different parts of the scale, known as scale region bias (Asaadi, Mohammad, and Kiritchenko,

2019). Therefore we adopt a comparative annotation setup known as BWS. Using BWS, we provide ﬁne grained degree of offensiveness score to each comment in a continuous range.

3.1.1 Best–Worst Scaling

Best–Worst Scaling (BWS) was proposed by Louviere (1991). Kiritchenko and Mo-hammad (2017) via some experiments, show that BWS creates more reliable ﬁne-grained scores than the scores acquired utilizing rating scales. Within the NLP com-munity, Best–Worst Scaling (BWS) has thus far been used only for creating datasets for relational similarity (Jurgens et al.,2012), word-sense disambiguation (Jurgens,

2013), wordsentiment intensity (Kiritchenko, Zhu, and Mohammad, 2014), phrase sentiment composition (Kiritchenko and Mohammad,2016), and tweet-emotion in-tensity (Mohammad and Bravo-Marquez,2017; Mohammad and Kiritchenko,2018).

1_{https://en.wikipedia.org/wiki/Reddit} 2_{https://www.reddit.com/r/changemyview/}

(23)

3.1. Theory 15 Further, this will be the ﬁrst dataset with degree of offensiveness scores for com-ments.

3.1.2 Annotating with Best–Worst Scaling

We follow the procedure described in Kiritchenko and Mohammad (2016) to obtain BWS annotations. Annotators are presented with 4 comments (4-tuple) at a time and asked to select the comment which is most offensive (least supportive) and the comment which is least offensive (most supportive). 2N (where N is the number of comments in the dataset) distinct 4-tuples are randomly generated, such that each comment is seen in eight different 4-tuples and no two 4-tuples have more than 2 items in common. We use the script provided by Kiritchenko and Mohammad (2016) to obtain the 4-tuples to be annotated.3Kiritchenko and Mohammad (2016) refer to the tuple generation process as random maximum-diversity selection(RMDS). Ideally, we want each tuple to contain items that cover the entire range of offensiveness. However, it is not possible since we do not know the degree of offensiveness of a comment before hand (Mohammad and Kiritchenko,2018). To address this issue, RMDS is used. Using RMDS ensures maximum diversity in a tuple by maximizing the number of unique items that each item co-occurs with. BWS annotations on the tuples generated by RMDS, results in direct comparative ranking for maximum pair of items (Mohammad and Kiritchenko,2018).

Kiritchenko and Mohammad (2016) show that on a word-level sentiment task, using just three annotations per 4-tuple produces highly reliable results. However, since we are working with long comments and as annotation of offensive language is a hard task, we get each tuple annotated by 6 annotators. Since each comment is seen in eight different 4-tuples, we obtain 8 X 6 = 48 judgements per comment.

Annotation Aggregation: The BWS responses are converted to scores using a simple counting procedure (Orme,2009; Flynn and Marley, 2014). For each item, the score is the proportion of times the item is chosen as the most offensive minus the proportion of times the item is chosen as least offensive.

3.1.3 Reliability of Annotations

We cannot use standard inter-annotation agreement measures to ascertain the qual-ity of comparative annotations. The disagreement that arises in tuples having two items that are close together in their degree of offensiveness is a useful signal for BWS. If for a particular 4-tuple, the annotators are not able to consistently identify the comment which is most or least offensive, then the disagreeing items in that tuple will have scores that are close to each other, which is the required result.

Using the scripts made available by Kiritchenko and Mohammad (2016) we cal-culate average split-half reliability (SHR) values. The quality of annotations can be measured by measuring reproducibility of the end result – if repeated manual an-notations from multiple annotators can produce similar rankings and scores then we can be conﬁdent about the quality of annotations received. To assess this re-producibility, we calculate average SHR over 100 trials. SHR is a commonly used approach to determine consistency in psychological studies. For computing SHR values, the annotations for each 4-tuple are randomly split in two halves. Using these two splits, two sets of rankings are determined. We then calculate the cor-relation values between these two sets. A high corcor-relation value indicates that the

(24)

16 Chapter 3. The Offensive Language Dataset annotations are of good quality. It should be noted that, these SHR values are com-puted using only half the number of annotations; correlation values are expected to be higher if the experiment is repeated with full annotations on each side. Therefore, the values obtained are just the lower bound of the quality of annotations obtained.

3.1.4 Quality Control

To determine the quality of annotators we use CrowdTruth metrics - a set of metrics that capture and interpret interannotator disagreement in crowdsourcing annota-tion tasks (Dumitrache et al.,2018). We used a CrowdTruth metric called the Worker Quality Score (WQS). WQS measures the overall agreement of one crowd worker with the other workers. Given a worker i, WQS(i) is the product of 2 separate met-rics - the worker-worker agreement WWA(i) and the worker-media unit agreement

WUA(i):

WQS(i) =WU A(i)WWA(i)

WWA for a given worker i measures the average pairwise agreement between the the worker i and all other workers, over all media units they annotated in com-mon. WUA measures the similarity between the annotations of a worker and the aggregated annotations of the rest of the workers (more details about these metrics can be found in the paper by Dumitrache et al. (2018)).

3.2 Annotation Pilots

We carried out all the annotation tasks on Amazon Mechanical Turk (AMT). Owing to the strong language, an adult content warning was given for the task. To maintain cultural uniformity only annotators from the United States of America were allowed to participate. To maintain the quality of annotations, only annotators with high approval rate were allowed to participate.

3.2.1 Pilot 1 & 2

Pilot 1 had a total of 3000 comments, a portion of which was sampled randomly and the rest were taken from different pre-identified subreddits.4 We selected the subreddits which we expected would be controversial, e.g., on topics like religion or world politics where members might have conflicting view points. This would boost the amount of offensive language comments in the dataset naturally. The first pilot annotation had a simple setup. We provided the annotators with the comment to be annotated (target comment), along-with 2 previous comments and the original post –

context. Annotators were then asked to mark whether the target comment,

consider-ing the context, is Abusive or Not Abusive. Figure3.1shows a sample questionnaire for pilot 1.

Each of the 3000 comments was annotated by 8 different annotators. We classi-ﬁed the comment as per majority annotation. This gave us 3.6% abusive comments, 94.13% not abusive comments and the rest were ties. The inter-annotator agreement, Fleiss’κ was found to be 0.29, which is considerably low. In offensive language anno-tation, comments can be difﬁcult to classify strictly into binary extremes (Salminen et al.,2018). In other words, for some comments, depending upon an individual’s 4_{List of subreddits used: askmen, askwomen, celebhub, environment, science, sports, trans, vaxx,}

(25)

3.2. Annotation Pilots 17

FIGURE3.1: Sample questionnaire for pilot 1

opinion, it might be hard to classify a comment as strictly abusive or strictly not abusive. To address this issue, we ran the second pilot. In pilot 2, everything else remaining the same, we changed our annotation setup to a Likert scale setup. In a Likert scale setup, annotators are asked to specify their level of agreement or dis-agreement on a symmetric scale for a given statement.5

We presented the annotators with 4 options:

• Deﬁnitely abusive • Likely abusive • Likely not abusive • Deﬁnitely not abusive

This would allow the annotators more room for cases where a comment’s classiﬁ-cation is not very strict. Figure3.2 shows a sample questionnaire for pilot 2. To classify the comments, we used a scoring format. We scored each comment i, over 8 annotations as follows: scorei =           

+10 if i was marked deﬁnitely abusive +5 if i was marked likely abusive −5 if i was marked likely not abusive −10 if i was marked deﬁnitely not abusive

We classiﬁed the comment as abusive if net scorei > 0, neutral if scorei = 0, and

not abusive if scorei <0.

This setup gave us only 2.2% abusive comments, and Fleiss’κ = 0.0014. Hence, the setup did not give us any improvements and hinted towards two big ﬂaws:

• On such a task, annotators tend to turn insensitive towards potentially abusive

comments after a few annotations.

• Likert scale annotation setup for such skewed data is not very efﬁcient, since

the chance agreement will always be very high, given the dominance of the non-abusive class.

(26)

18 Chapter 3. The Offensive Language Dataset

3.2.2 Pilot 3 & 4

To overcome the ﬂaws noticed in pilot 2, we made two major changes in pilot 3: 1. The annotation setup was changed to a comparative annotation setup, to

over-come the issues of annotator de-sensitivity.

2. Comments coming from the random subreddits were boosted using positive and negative sentiment keywords, from the the NRC Emotion Lexicon (EmoLex) (Mohammad and Turney,2013).

Apart from these changes, we also made reading the context for the target

com-ment optional. The context was now provided in a collapsible box, which could be

expanded by the annotators when required.

EmoLex is a list of English words and their associations with eight basic emo-tions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) and two sentiments (negative and positive). The EmoLex consits of 14,182 words labeled as positive or negative in the sentiment dimension. We use these keywords to score the comment threads obtained from the subreddits. For each thread t, the positive

sentiment score and negative sentiment score are computed as follows:

(positive sentiment score)_t = No. of positive sentiment keywords in the comment thread

No. of words in the comment thread also in Emolex

similarly for(negative sentiment score)_t.

The threads are then ranked according to their positive sentiment score and negative

sentiment score. The top ranking 5 threads were taken from each. Comments from

rest of subreddits were the same as used in previous pilots resulting in total of 3000 comments.

To employ the comparative annotation, we used the BWS setup. Annotators were presented with 4 comments and asked which of the 4 comments is the most offensive and which of them is the least offensive. Each such 4-tuple was annotated by 6 different annotators. Figure3.3shows a sample questionnaire for pilot 3. In a BWS setup, the reliability of annotations is measured by Split Half Reliability (SHR) scores. We obtain a SHR Spearman correlation of 0.0337±0.0127 and SHR Pearson correlation of 0.0376±0.0127, which is extremely low.

(27)

3.2. Annotation Pilots 19

Pilot 4 was a simple update over pilot 3. In pilot 4 we simpliﬁed our instruc-tions to the annotators and made the task more clear. We tested our setup on 1000 comments. We no longer provided the context to the annotators, due to two reasons:

• context could be distracting the annotators from the target comment

• in pilot 3 we had implemented a check and observed that the context was

clicked only for 1% of the total annotations.

However, we still saw no improvements with SHR Spearman correlation being 0.0807 ±0.0203 and SHR Pearson correlation being 0.0940±0.0211.

Successive pilots with multiple setups showed us that the problem did not lie with the annotation scheme but the nature of our data. As the data is highly skewed towards non abusive class we see low inter annotator agreement and less reliable annotations.

3.2.3 Pilot 5 – Aggressive Pilot

To overcome the issues of class imbalance, we boosted our comments sample aggres-sively to obtain more number of abusive comments. For this pilot we took a total of 1000 comments which were sampled as follows:

• 20% of comments were sampled from the subreddit CMV using positive

senti-ment keywords from the EmoLex.

• 20% of comments were sampled from the subreddit CMV using negative

sen-timent keywords from the EmoLex.

• 30% of comments were sampled from the other subreddits using positive

(28)

20 Chapter 3. The Offensive Language Dataset Pilot Annotation Scheme No. of Comments Type of Sampling No. of annotators per instance Annotation Reliability

1 Binary class 3000 Topic based 8 Fleissκ: 0.29

2 Likert scale 3000 Topic based 8 Fleissκ: 0.0014 3 Comparative 3000 Topic based,

sentiment keywords 8

SHR Spearman: 0.0337±0.0127 SHR Pearson: 0.0376±0.0127 4 Comparative 1000 Topic based,

sentiment keywords 6

SHR Spearman: 0.0807±0.0203 SHR Pearson: 0.0940±0.0211 5 Comparative 1000 Aggressive 6 SHR Spearman: 0.5452±0.0160

SHR Pearson: 0.5525±0.0161

TABLE3.1: Pilots

• 20% of comments were sampled from the other subreddits using negative

sen-timent keywords from the EmoLex.

• 10% of comments were those that were marked as abusive by majority in pilot

1.

In pilot 3 and 4, while using the top ranking sentiment threads, we were still getting a large number of not abusive comments. Therefore, unlike the previous pilot, in pilot 5 we score and rank individual comments and select the top ranking comments . This pilot gave us promising results with SHR Spearman correlation of 0.5452± 0.0160 and SHR Pearson correlation of 0.5525± 0.0161. We improved the quality of our aggressive pilot data by simply removing annotations of tuples by workers with WQS<0.2. We re-calculated the SHR values and an improvement was observed. The score distribution for this data is from -1 to 1. Table3.3shows the values for full and improved versions of the data.

Table 3.1summarizes the results and values for the 5 pilots. Pilot 5 paved our way for the ﬁnal round of annotation. Along with our ﬁnal dataset, in chapter4we report the performance of our computational models on pilot 5 data as well.

3.3 Final Annotation Round

Over the ﬁve incremental pilots described above, we made various changes to the annotation instructions, comments sampling technique, annotation scheme, the in-terface provided to the annotators, and other factors which all proved to be helpful in the success of the Aggressive pilot. This incremental approach not only helped us remove any loopholes or drawbacks in the system, but also gave us a better un-derstanding of offensive language and the task at hand. Depending on the success of the Aggressive pilot, we decided to use the comparative annotation scheme and an aggressive mode of sampling potentially offensive comments. In this section, we discuss in depth the comments sampling strategy and the BWS setup used for the ﬁnal round of annotation.

3.3.1 Comments Sampling

Reddit data can be easily extracted using Google BigQuery from the Pushshift Reddit Dataset (Baumgartner et al., 2020). The Pushshift Reddit Dataset made available for research is updated in real-time, and includes historical data back to Reddit’s inception. The dataset has added functionalities that aid in searching, aggregating, and performing exploratory analysis of the entire Reddit data.

(29)

3.3. Final Annotation Round 21 We base our entire dataset on the CMV subreddit owing to the subreddit’s di-verse nature. As the name of the subreddit suggests, members post and comment on conﬂicting viewpoints. This can easily lead to a heated discussion often resulting in the use of offensive language. As the subreddit is not based on a particular topic, we can avoid any topic bias in our dataset. From CMV, we choose the posts based on certain criteria:

1. Date : To extract comments on the most relevant current topics, we take com-ments from the time period of January, 2015 to September, 2019 (last available month at the time of extraction).

2. Thread length : We choose posts with more than 150 comments and less than 5000 comments.

3. Post length : We choose posts containing more than 5 words and less than 60 words in the post body.

4. URL : We choose posts with at most 1 URL in them.

This returns 1823 posts. We then reconstruct the hierarchical thread for each post using the Anytree python library. A post can have a large number of comments. To include maximum number of posts, we limit the number of comments taken from each post to 50. We take the ﬁrst 25 and the last 25 comments per post. With the aim of creating a dataset of 10,000 comments we ﬁlter comments from these posts based on the following criteria:

1. Post length : We choose comments containing more than 5 words and less than 250 words in the comment body.

2. No. of users : In the ﬁrst and last 25 comments of the thread we ensure partic-ipation of at least 4 users.

This forms an initial pool of 60,200 comments from which we will now select 10,000 comments for our dataset.

Emotions are highly representative of an individuals mental state associated with thoughts, feelings and behaviour (Poria et al.,2019). People use offensive language to express emotions, especially anger and frustration. Swear words are appropriate to communicate feeling as their primary implications are connotative (Jay and Jan-schewitz,2008). Several studies have shown that the primary dimensions of word meaning are valence, arousal, and dominance (VAD) (Osgood, Suci, and Tenen-baum,1957; Russell,1980; Russell,2003).

• Valence dimension indicates positive–negative or pleasure–displeasure

emo-tion.

• Arousal dimension indicates excited–calm or active–passive emotion.

• Dominance dimension indicates powerful–weak or ’have full control’–’have

no control’ emotion. (Mohammad,2018)

Therefore, to boost the representation of abusive comments in our dataset, we use emotion keywords from the NRC VAD lexicon (Mohammad,2018). The NRC VAD lexicon is a list of 20,000 English words each with a real valued score between 0 and 1 in the V, A, D dimensions. To ﬁnd abusive comments which represent strong emotions, we divide the words in the NRC VAD lexicon in 6 emotion dictionaries, 2 for each dimension. The dictionaries are as follows:

(30)

• Low valence: Consisted of words with valence score≤0.25

• High valence: Consisted of words with valence score≥0.75

• Low arousal: Consisted of words with arousal score≤0.25

• High arousal: : Consisted of words with arousal score≥0.75

• Low dominance: Consisted of words with dominance score≤0.25

• High dominance: Consisted of words with dominance score≥0.75

We then obtain 6 scores for each of the 1000 comments in the aggressive pilot data using the dictionaries described above. The emotion score over each emotion e, for a comment i is calculated as:

scoreei = 1 n n

∑

i=1 score(wi)

where n = No. of words in the comment also in the emotion e lexicon

score(wi) = score of word wifrom the comment in the emotion e dictionary.

We also obtain the degree of offensiveness score for each comment from the BWS setup. We calculate the Pearson correlation between these offensiveness scores and each of the 6 emotion scores obtained above. The results are shown in table 3.2. Figure3.4 shows a scatter plot between the offensiveness scores and the emotion scores.

Pearson correlation of offensiveness with Pearson’s correlation coefﬁcient

High Valence words 0.0291

Low Valence words 0.3311

High Arousal words 0.3051

Low Arousal words 0.1223

High Dominance words 0.1167

Low dominance words 0.2281

TABLE 3.2: Pearson values between Offensiveness scores and

Emo-tion scores obtained on aggressive pilot Data

From table 3.2 we can see that low valence and high arousal words gives us the strongest signals for degree of offensiveness of comment. Low valence repre-sents negative polarity, relating to emotions of fear and anger. While, high arousal represents powerful feelings. Therefore, using these low valence and high arousal keywords we make our ﬁnal dataset of 10,000 comments from the pool of 60,200 comments we obtained previously. We select the 10,000 comments as follows:

• we choose 33% (3300) top scoring low valence comments • we choose 33% (3300) top scoring high arousal comments

• and, 34% (3400) comments are chosen randomly, to mitigate the effects of

key-words sampling bias.

From the pilot annotation rounds we make a key observation: When the num-ber of non-offensive comments is extremely high, the annotators start marking ran-domly since most of the tuples do not exhibit any offensive comment. To address

(31)

3.3. Final Annotation Round 23

FIGURE3.4: Emotion-Offensiveness scores scatter plots for all com-ments from the aggressive pilot data

this issue we add 1000 abusive comments from the Waseem and Hovy (2016) dataset. These comments just act as a moderating mechanism to prevent annotators from slacking off. These comments are later discarded and not used as a part of our dataset.

3.3.2 Annotations

A sample questionnaire for the ﬁnal annotation task is shown in ﬁgure3.5.

FIGURE3.5: Sample questionnaire for the ﬁnal annotation task

We carry out the annotation of the entire dataset in 5 parts. We keep a close watch on the quality of annotations we receive while the task was live and immediately block the bad quality annotators. We block workers with WQS< 0.3. We also do

(32)

24 Chapter 3. The Offensive Language Dataset quality checks after each part and annotators whose annotations are very different from that of others are blocked from participating in the subsequent parts.

Using the annotations, we get a score range of -0.542 (least offensive) to 0.521 (most offensive). Table3.3shows the SHR values for the annotations on our dataset.

3.3.3 Improving The Quality of Our Dataset

To improve the quality of our dataset, we re-compute the WQS values over the entire dataset (5 parts combined) for all workers who participated in our task. We check the distribution of annotations for each tuple after removing annotations by work-ers with WQS< 0.3. Figure 3.6shows the distribution of the remaining per tuple annotations. 34% of the tuples are left with zero annotations. We then carry out re-annotation of these tuples by 6 annotators each, on AMT. These annotations are added to the main set of annotations and we calculate the WQS values again. We check the distribution of annotations for each tuple after removing the annotations by workers with WQS<0.3 . Figure3.7shows the distribution of the remaining per tuple annotations. We can see that the percentage of tuples with zero annotations has reduced to 20%.

FIGURE3.6: Distribution of per tuple annotation on the initial set of annotations, after removing low quality annotations

To further improve the quality, we combine the re-annotations and the initial set of full annotations. We remove items (comments) with high disagreement in their annotation. This could be due to non-reliant annotators who marked answers randomly. To do this,

1. for each tuple we calculate average agreement on best (AvgBest) and average agreement on worst (AvgWorst) annotation.

2. we take the average of AvgBest and AvgWorst to compute the average agree-ment on responses for this tuple. We call this AvgT.

3. Now, for each item, we determine average of the AvgT scores for each tuple the item occurs in. We call this AvgItem.

(33)

3.3. Final Annotation Round 25

FIGURE3.7: Distribution of per tuple annotation on the initial plus secondary set of annotations, after removing low quality annotations

This helps us remove any noise in the scores and ranking of comments in our dataset. We are left with 5,135 comments out of the 10,000 comments we started with. We re-calculate the SHR values, only on the remaining items and observe an improve-ment. Table3.3 shows the SHR values. The improvement in the SHR values con-ﬁrms the improvement in the quality of our dataset. Figure3.8 shows a histogram for comment–degree of offensiveness over 40 equi-spaced bins of size 0.05. We can observe a normal distribution.

Offensive Language Dataset Aggressive Pilot Dataset

Full Improved Full Improved

No. of comments 10,000 5135 1000 1000 Score range 0.521 to -0.542 0.521 to -0.542 0.5 to -0.5 1 to -1 SHR Pearson Correlation 0.4219

±

0.0057 0.4913

±

0.0071 0.5525

±

0.0161 0.8186

±

0.0071 SHR Spearman Correlation 0.4044

±

0.0056 0.4774

±

0.0071 0.5452

±

0.0160 0.8073

±

0.0077

TABLE3.3: Results summary on offensive language dataset and

ag-gressive pilot dataset

3.3.4 Analysis

Earlier as a moderation mechanism, we had introduced an additional 1000 abusive comments from the Waseem and Hovy, 2016 (WH) dataset to our pool of 10,000 comments. After discarding items for improving the quality of our dataset, we were left with only 544 of these comments. Though all the 1000 items from the WH dataset are removed from our ﬁnal dataset, we analyse the distribution of these items to get a sense of the quality of the dataset.

(34)

FIGURE3.8: Distribution of scores obtained from BWS on our dataset

As the comments taken from the WH dataset were all from the abusive class (500 from racism and 500 from sexism), we expect the distribution of scores over these items to be skewed towards the most offensive score. Figure3.9shows the distribu-tion of scores for WH comments. The figure shows a normal distribudistribu-tion. We notice only a slight skew of scores towards the positive side. This means, a large number of comments marked as abusive in the WH dataset are not marked abusive in our annotation setup. To examine this, we check the WH comments which received the lowest scores in our setup. Table3.4shows a few examples along with their offen-siveness score. We can see that these comments are not actually abusive in nature. This shows the efficacy of the annotation setup we use. We check the WH comments which received high scores in our setup. Table3.5shows a few examples along with their offensiveness scores. We can see a fine gradation in the degree of offensiveness as we move down the scale. It is also interesting to note that the third comment which has been categorized as “sexism” in the WH dataset does not show any traits of sexism. This can be confusing for machine learning or deep learning models, as such comments would appear together with comments which are actually sexist in nature. Learning such noisy features can lead to incorrect classification. How-ever, when presented with varying degree of offensiveness, such comments will be learned differently by the models. This shows that it is better to represent the offen-siveness of a comment with a score in continuous domain (degree) rather than with discrete classes.

3.3.5 Conclusion

In this chapter we describe how we create the Offensive Language Dataset. The annotations are crowdsourced, capturing the degree of offensiveness for comments on social media (Reddit). To overcome the limitations of the existing datasets, we create a dataset with ﬁne grained scores, using the BWS setup. This setup overcomes the short-comings of a setup using Likert or rating scale for such a task, by providing

Conversation Modeling in Offensive Language Detection

MSC

ARTIFICIAL

INTELLIGENCE

M

T

Conversation Modeling in Offensive

Language Detection

R

H

August 20, 2020

Abstract

Acknowledgements

Contents

Chapter 1

Introduction

1.1

Research Questions and Motivation

1.2

Contributions

1.3

Thesis Structure

Chapter 2

Related Work

2.1

Existing Offensive Language Detection Datasets

2.2

Computational models

2.3

Conversation Modeling

Chapter 3

The Offensive Language Dataset

3.1

Theory

3.2

Annotation Pilots

3.3

Final Annotation Round

∑

±

±

±

±

±

±

±

±