Contextual Feature Abuse Detection T

(1)

Contextual Feature Abuse Detection

The Use Of Author And Parent Micropost Features

For Abuse Detection Datasets

L.E.D. (Louis) de Bruijn

Master Thesis Information Science Informatiekunde

Louis Emile Deodatus de Bruijn s2726327

(2)

1

_{A B S T R A C T}

In recent years, the exploration of automatic detection of online abusive language on social media has consolidated in a popular research direction. However, many of these systems make use of unbalanced and biased datasets. First, this research provides a critical overview of these biases and tries to subsequently account for them in the creation of a newly annotated abusive language dataset on the Twit-ter domain. Second, this research aims to investigate the inclusion of contextual features in the collection, annotation, and classification process.

This research presents a dataset with a total of 10,150 annotated tweets for a first binary classification of abuse/no-abuse and a second, more fine-grained binary classification of explicit and implicit abuse. The dataset has two main characteristics: (1) it solely contains tweet @replies with the inclusion of the previous tweets and (2) it heavily contains offensive swear words, representing informal Internet speech found abundantly on Twitter.

The performance of a simplistic supervised machine learning model in a cross-dataset experimental design has shown that the cross-dataset generalizes the concept of abuse well, but the subtypes of explicit and implicit forms of abuse not so well. A distinction made in this research is that offensiveness does not imply abusiveness. Future research can use this dataset to further investigate the relationship between abuse and offensiveness in a @replies network.

WARNING: This paper contains tweet examples and words that are offensive in nature.

(3)

C O N T E N T S

1 abstract 1 2 introduction 1 3 related work 3 3.1 Abusive Language . . . 3 3.2 Bias . . . 4

3.3 Best Practices For Dataset Creation . . . 6

4 data 8 4.1 Datasets . . . 8 4.1.1 HatEval . . . 8 4.1.2 OffensEval . . . 9 4.1.3 AbusEval . . . 11 4.2 Collection . . . 11 4.2.1 Twitter API . . . 12

4.2.2 Class Label Conversion . . . 12

4.2.3 Data Distribution . . . 13 4.2.4 Processing Constraints . . . 14 4.3 Annotation . . . 14 4.3.1 Abusiveness . . . 15 Annotators . . . 16 Inter-Annotator Agreement . . . 16

Gold Label Annotation . . . 16

Pre-annotated & Gold Labels . . . 17

4.3.2 Explicitness . . . 18

4.3.3 Bias Reduction . . . 19

Author Bias . . . 20

Previous Tweet Bias . . . 20

Creation Date Bias . . . 20

Tweet Text Bias . . . 21

5 methods and experiments 22 5.1 Methodology . . . 22

5.1.1 Support Vector Machine Model . . . 22

5.1.2 Vectorizer . . . 23 5.1.3 Feature Pipeline . . . 23 5.1.4 Train/Test Split . . . 23 5.2 Fine-tuning Experiments . . . 24 5.2.1 Vectorizer . . . 24 5.2.2 Preprocessing . . . 24 5.2.3 Features . . . 25 5.3 Best-performing Experiments . . . 26 5.4 Cross-dataset Experiments . . . 26

5.5 Most Informative Tokens . . . 27

5.6 Previous Tweet Annotation . . . 28

6 discussion 30 6.1 Fine-tuning Experiments . . . 30 6.2 Cross-dataset Experiments . . . 31 6.2.1 HatEval . . . 31 6.2.2 OffensEval . . . 31 6.2.3 AbusEval . . . 32

6.3 Most Informative Tokens . . . 32

6.4 Annotation Process . . . 33

6.5 Boosting Abusive Content . . . 34

(4)

CONTENTS 3

7 conclusion 35

7.1 The Most Relevant Biases . . . 35

7.2 Controlling For Bias In Data Creation . . . 36

7.3 Including Context-related Features . . . 36

7.4 Limitations . . . 37

7.5 Future Work . . . 37

8 acknowledgements 39 Appendices 43 a database objects 44 a.1 Tweet and TwitterUser Object Features . . . 44

a.2 Relational Database Scheme . . . 45

b inter-annotator agreement 46 c annotator demographic of the annotation 49 d annotation guidelines 50 d.1 Abuse Annotation Guidelines . . . 50

d.2 Explicit Annotation Guidelines . . . 51

e tweet creation distribution 52 e.1 Abusive and non-abusive . . . 52

e.2 Explicit and implicit . . . 53

f example tweet excerpts 54 f.1 HatEval Example Tweets . . . 54

f.2 OffensEval 2020 Training Example Tweets . . . 55

g hypertune parameters results 56 g.1 Vectorizer Parameters . . . 56

g.2 Preprocessing Parameters . . . 56

g.3 Features Performance . . . 57

h hypertuned experiments results 58 i cross-dataset results 59 i.1 HatEval . . . 59

i.2 OffensEval . . . 60

(5)

2

_{I N T R O D U C T I O N}

As the public debate about violence, racism, misogyny, and hate resurfaced in the last couple of months, it has also transcended into the domain on the Internet. As a result, the detection of abusive language in user-generated online content has become an issue of increasing importance. For instance, organizers of the Stop Hate

For Profit movement1

have very recently initiated a campaign to put a halt to the incitement of violence prevalent on Facebook and have gained tremendous support,

both in the United States and the Netherlands.2

Because of the vast quantity of user-generated content on these social media platforms, manually checking all possibly abusive content has proven unfeasible (Schmidt and Wiegand, 2017). The Natural Language Processing (NLP)

commu-nity has responded with the development of several classification systems for the

automatic detection of abusive language (Jurgens et al.,2019). These supervised

ma-chine learning systems require an annotated dataset with both abusive and benign microposts to train on.

Online abusive language is commonly defined as "an omnibus term that in-cludes hurtful, derogatory or obscene utterances expressed towards a targeted group

or specific members of that group" (Davidson et al.,2017;Jurgens et al.,2019;

Wie-gand et al.,2019).

Because abusive language is an elusive phenomenon that occurs very sparsely, with estimates ranging from 0.1 to 3% of tweets containing abusive content, random sampling microposts for these datasets would result in too little abusive content (Schmidt and Wiegand, 2017; Founta et al., 2018; Ribeiro et al., 2018). Many of

these datasets are therefore collected via some sort of focused sampling method,

introducing a wide variety of biases that often go undetected (Wiegand et al.,2019).

Topic bias, domain bias, bias towards subtypes of abuse, such as explicitly abusive content, and author bias are all present. A lack of a clear definition, typology of the different subtasks, and well-documented annotation guidelines further deteriorate

the objectivity of these datasets (Ribeiro et al.,2018;Waseem et al.,2017). Waseem

et al.(2017) have proposed a two-fold typology to synthesize the different subtasks

under the umbrella term of online abuse. The typology considers first whether the abuse is directed at a specific target or generalized, and second the degree to which it is explicit.

Even though several surveys have outlined the importance of these biases, most

prominently the paper byWiegand et al.(2019), the field is still to benefit from an

less-biased benchmark dataset controlling for bias (Schmidt and Wiegand,2017).

Previous work suggest taking knowledge-based contextual information into

ac-count, both in the collection, annotation, and classification process (Schmidt and

Wiegand,2017;Wiegand et al.,2019). This research aims to identify the most

rele-vant biases in abusive language datasets in order to create a more balanced dataset and to investigate the performance of including context-related features in both the annotation process and the experimental settings.

The research questions of this study are

1. What are the most relevant biases in abusive language datasets?

2. How are the most relevant biases controlled for in data creation?

3. Does the inclusion of context-related features improve annotations and

auto-matic classification scores? 1

https://www.stophateforprofit.org/

2

https://www.nrc.nl/nieuws/2020/07/02/stop-hate-for-profit-heineken-tomtom-boycotten-facebook-a4004704

(6)

introduction 2

This paper follows the typology provided by (Waseem et al.,2017) and presents

a newly annotated dataset, the @replies dataset, of around 10,000 microposts on Twitter for a first binary classification of abusiveness and a second fine-grained binary classification of the explicitness in its abuse.

The body of the research concerns the identification of several biases in this dataset and evaluates classification scores in a cross-dataset experimental design of previously collected datasets from two shared tasks of the International Workshop on Semantic Evaluation: HatEval and OffensEval. The @replies dataset is focused on contextual features, most importantly the previous or parent tweet that tweets are responding to and the user description. Responding to other microposts is a key characteristic of social media platforms and an integral part of the spread of knowledge through social media networks. This study aims to further develop the novel research direction of including these contextual features in the annotation and classification process.

To answer our research questions it is useful to discuss the previous literature concerning the different forms of bias and the complexity of annotating abuse in the next chapter. Chapter 4 provides a description of the data and material. Chapter

5 provides a detailed overview of the methodology, including data collection,

an-notation, processing, experimental design and presents the results of the different experiments. Chapter 6 thoroughly discusses the implications and findings. Chap-ter 7 will conclude the findings, answer the research questions, discuss the studies’ limitations and proposes future research directions.

(7)

3

_{R E L A T E D W O R K}

This chapter presents related work in the field of abusive language detection. First, it details the inherent difficulties of online abusive language. Secondly, it discusses the field’s current advances and datasets for the automatic detection of online abu-sive language. Thirdly, it sharply defines online abuabu-sive language, its target and the degree to which it is explicit. Next, it introduces several biases identified in the field: annotation bias, bias in explicit and implicit markers, topic bias, author bias and several ways to reduce these biases. At last, an evaluation of these biases is discussed.

3.1 abusive language

Detection of abusive language in user-generated online content becomes more im-portant as social media platforms have become under increasing public and

gov-ernmental pressure to tackle the issue of online abuse (Schmidt and Wiegand,2017;

Waseem et al.,2017). The prevalence of online abuse has shown that social media

platforms can be used to incite mobs of people to violence as extreme ideas have a

tendency to spread throughout social networks (Brady et al.,2017). Moreover,

re-cent surveys indicate that online abusive behavior happens much more frequently

than previously suspected (Jurgens et al.,2019).

Nonetheless, online abusive language remains an elusive phenomenon that oc-curs very sparsely, with estimates ranging from 0.1 to 3% of tweets containing

abu-sive content (Founta et al.,2018; Ribeiro et al., 2018). Because of this, there is a

very skewed distribution between benign and abusive microposts (Schmidt and

Wiegand,2017). Even though abusive content occurs sparsely, manually inspecting

all possibly abusive content in online social media networks is unfeasible (Schmidt

and Wiegand,2017).

The NLP community has responded by developing several classification systems

for the automatic identification of abusive language (Jurgens et al.,2019). These

su-pervised machine learning systems require an annotated dataset containing both abusive and benign microposts. Random sampling for these datasets would result in too little abusive content, which is why many choose some sort of focused

sam-pling method (Schmidt and Wiegand, 2017). These introduce a variety of biases,

that often go undetected. Addressing these biases results in much lower

classifica-tion scores of these systems (Wiegand et al.,2019).

Online abusive language is a very broad concept of which the categorization has proven difficult: a broad categorization is likely too computationally inefficient, yet narrow categorization risks further reinforcing harm to affected community

mem-bers (Jurgens et al.,2019). Detection of hate speech, derogatory language and

cyberbul-lying are all different subtasks that have been grouped under the umbrella term of

online abusive language (Waseem et al.,2017). The lack of a general definition has

resulted in contradictory annotation guidelines, making it difficult to correctly

com-pare their results (Waseem et al.,2017;Schmidt and Wiegand,2017). For instance,

some microposts identified as hate speech byWaseem and Hovy (2016), are only

considered offensive byNobata et al.(2016) andDavidson et al.(2017).

Online abusive language is commonly defined as "an omnibus term that in-cludes hurtful, derogatory or obscene utterances expressed towards a targeted group

or specific members of that group" (Davidson et al.,2017;Jurgens et al.,2019;

Wie-gand et al.,2019).

(8)

3.2 bias 4

Waseem et al.(2017) have proposed a typology to avoid the contradictions and

synthesize the different subtasks under the umbrella term of online abuse. This two-fold typology considers first whether the abuse is directed at a specific target or generalized and second the degree to which it is explicit. Explicit abusive language is that which is unambiguous in its potential to be abusive, using hateful terms (Barthes,1957). Implicit abusive language is that which does not immediately

im-ply or denote abuse, often obscuring its true nature by the lack of hateful terms (Waseem et al.,2017).

Schmidt and Wiegand(2017) mention a lack of a benchmark dataset and suggest

that the community would benefit from a dataset underlying a commonly accepted definition of the task. They conclude that it remains to be seen whether abusive language classification methods can solve the problems posed or can only solve certain subtypes of abuse.

3.2 bias

• Contradictory annotation guidelines as a result of a loosely defined concept

of online abuse and vague documentation.

• Topic bias as a result of selective sampling, such as bias towards certain

sub-types of abuse such as misogyny, racism or sexism or towards certain hashtags on Twitter or topics such as American politics or LGBTQ communities.

• Overemphasis on explicit forms of abuse as a result of selective sampling

methods via an abusive language lexicon.

• Underexposure of implicit forms of abuse as a result of previously mentioned

sampling methods and the inherent difficulties of identifying and annotating implicit abuse.

• Author bias as a result of selective sampling, where only a small group of

authors accounts for a majority of the abuse classes, topics or subtypes of abuse.

Several studies consider the importance of bias introduced in the domain of online abusive language detection and the considerations when interpreting the

re-sults of these studies (Klubiˇcka and Fernández,2018;Schmidt and Wiegand,2017;

Wiegand et al.,2019). In particular,Wiegand et al.(2019) shows that many datasets

model the bias instead of the phenomenon of abusive language and that classifica-tion scores on popular datasets are much lower under realistic settings in which the bias is reduced.

Apart from issues with contradictory annotation guidelines due to a lack of a proper definition, often these guidelines remain fairly vague and undocumented (Schmidt and Wiegand,2017). Several studies reflect on the difficulties of

annotat-ing online abuse, especially implicit forms of abuse (Waseem et al.,2017;Schmidt

and Wiegand,2017). As such, there seems to be little consensus on the definitions

of the different subtasks among annotators, with Kappa scores ranging from 0.6κ

to 0.8κ (Waseem et al., 2017). For instance, both Waseem and Hovy (2016) and

Davidson et al.(2017) find that annotators consider racists or homophobic terms to

be hateful, but consider words that are sexist and derogatory towards women to be only offensive. As a result of these considerations, several authors favor expert annotators with domain-knowledge over the employment of non-expert annotators

through services such as Amazon Mechanical Turk (Schmidt and Wiegand,2017;

Waseem et al.,2017;Nobata et al.,2016).

Because of the skewed distribution of benign and abusive micropost, some form of selective sampling is necessary in order to model online abusive language. Sev-eral studies employ a list of explicit markers or topics (such as hashtags) of abuse

(9)

3.2 bias 5

to target microposts that are likely to be abusive (Klubiˇcka and Fernández, 2018;

Waseem and Hovy, 2016). Explicit markers of abuse may be profanities, swear

words or hate terms, sourced from dictionaries such as the popular Hatebase

lexi-con.1

Generally, this form of selective sampling introduces topic domain bias, bias against certain subtypes of abuse and bias towards a very direct, textual, and

offen-sive form of explicit abuse, while ignoring implicit abuse (Klubiˇcka and Fernández,

2018;Schmidt and Wiegand,2017;Ribeiro et al.,2018). Wiegand et al.(2019)

com-pute the pointwise mutual information (PMI) for the six most-used datasets in the field to illustrate the coherence of certain words with classes. PMI is a metric to compare the most informative features of a class. What their paper undoubtedly shows is that almost all datasets have some form of topic bias. For instance, the highest-ranking abuse markers in the Waseem-dataset are commentator and football, making it extremely biased towards the topic of sports. Using explicit markers such as the hashtag WomanAgainstFeminism also biases a dataset towards a certain form of abuse, sexism, but leaves other forms of abuse unscathed.

On top of that, the underlying reasoning that abusive users frequently use

ex-plicit terms lacks evidence. Davidson et al.(2017) show that only 5% of the tweets

collected via the Hatebase lexicon were coded as hate speech, demonstrating the im-precision of this lexicon. They conclude that the presence or absence of particular

offensive terms can both help and hinder accurate classifications. Fast et al. (2016)

show that the choice of the vocabulary of hateful users is different when compared to their normal counterparts: words related to masculinity, love, and curses occur

more often. Ribeiro et al. (2018) compare hateful users to normal users, defining

hateful users according to the Twitter guidelines: A hateful user is a user that en-dorses any type of content that "promotes violence against or directly attack or threaten other people on the basis of race, ethnicity, national origin, sexual

orienta-tion, gender, gender identity, religious affiliaorienta-tion, age, disability, or disease.".2

Their study shows that hateful users use words related to hate, anger, shame, terrorism, violence, and sadness significantly less than non-hateful users (p-value < 0.001). This in turn questions the usage of bad-keywords lexicons in the data collection process.

Examples of implicit markers of abuse are condescension, minimization, benev-olent stereotyping, sarcasm, euphemisms, and micro-aggressions that are typically

linguistically subtle (Waseem et al.,2017;Jurgens et al.,2019). As such, basic

word-filters do not provide a sufficient remedy for implicit abuse detection (Schmidt and

Wiegand, 2017). Several studies note that the implicit forms of online abuse are

much harder to identify and annotate (Davidson et al., 2017; Ribeiro et al., 2018;

Waseem et al.,2017). The literary overview provided byJurgens et al.(2019) also

concludes that implicit forms of abuse are not targeted by current NLP advances.

Wiegand et al. (2019) have shown that datasets that have a higher proportion of

implicit abuse are more affected by the bias introduced in the data collection via focused sampling methods. In order to create a more balanced dataset that gener-alizes better towards the concept of online abusive language, implicit abuse should be targeted more.

If the set of tweets belonging to a certain class come from the same or a specific

group of authors, this introduces author bias. Ribeiro et al.(2018) andKlubiˇcka and

Fernández(2018) found that the user distribution of abusive tweets on the

Waseem-dataset Waseem-dataset is highly skewed. More than 70% of all sexist tweets originate from

two authors and 99% of the racist tweets originate from a single author. Qian et al.

(2018) and Mishra et al.(2018) suggest that author information might be relevant

for abuse detection, however evaluated these claims solely on the Waseem-dataset,

which is highly biased towards author bias. Therefore,Ribeiro et al.(2018) conclude

1

www.hatebase.org

2

(10)

3.3 best practices for dataset creation 6

that these claims about relevant author information should be considered less pre-dictive.

3.3 best practices for dataset creation

Bias Proposed solution

Contradictory annotation guidelines

Sharply define online abusive language and make use of two-fold typology provided in

Waseem et al.(2017).

Topic bias

Avoid focused sampling via bad-keywords dictionaries or controversial microposts and verify the distribution of topics via the most informative features.

Overemphasis of explicit abuse

Ribeiro et al.(2018) andFounta et al.(2018)

propose focused sampling methods other than bad-keywords dictionaries.

Underexposure of implicit abuse

Implicit forms of abuse should be targeted more because of the sparse nature of these forms even within abusive microposts. Author bias

The number of microposts per author should be restricted and the individual author characteristics such as location and gender should be analyzed and balanced.

Table 1: Identified biases and the proposed solutions provided in relevant literature.

The current body of literature on online abusive language proposes several meth-ods aimed specifically at reducing the previously mentioned biases. The typology

provided byWaseem et al.(2017) can be used in the data collection and annotation

process in order to sharply define subtasks used (Davidson et al.,2017).

Annotation guidelines should be developed that are based on a clear and

ac-cepted definition of online abusive language (Schmidt and Wiegand,2017;Jurgens

et al.,2019). A well-accepted definition constitutes online abusive language as "an

omnibus term that includes hurtful, derogatory or obscene utterances expressed

to-wards a targeted group or specific members of that group" (Davidson et al.,2017;

Jurgens et al.,2019;Wiegand et al.,2019). Ribeiro et al.(2018) provide a very

well-documented overview of the data collection and annotation process and guidelines that helps reduce contradictions. Additionally, a more transparent documentation of the annotation process and modeling strategies directed at different types of

abuse increases the reproducibility of experiments (Waseem et al., 2017).

Anno-tators could be given the entire profile of an author, instead of individual tweets (Ribeiro et al.,2018;Waseem et al.,2017). The improvement of annotation quality

by extending such an additional context is a promising research direction (Ribeiro

et al.,2018).

Data scarcity should be addressed while minimizing the harm caused by data

collection (Schmidt and Wiegand,2017). Focused sampling methods other than

bad-keywords dictionaries are proposed byRibeiro et al. (2018), who sampled tweets

via a semi-supervised network analysis of a set of seed users and Founta et al.

(2018) who used a random sample and applied some heuristics in order to boost the

proportion of abusive microposts. During the data-collection process, both explicit

and implicit forms of online abuse should be targeted equally (Jurgens et al.,2019).

As a means to avoid topic bias, sources of training data should be sought that are hateful without necessarily using particular keywords or explicit offensive language (Davidson et al.,2017). Furthermore, the most informative features of a class should

(11)

3.3 best practices for dataset creation 7

these features (terms) randomly, via sampling additional microposts (Ribeiro et al.,

2018).

In order to avoid author bias, the number of microposts per author should be

restricted (Ribeiro et al.,2018). To create a more balanced user distribution, the focus

should be on the analysis of the individual characters such as location, gender, and

age of the authors (Davidson et al.,2017;Waseem and Hovy,2016).

These dimension of bias have an effect on the performance of an abusive

lan-guage detection system. Because of this,Wiegand et al.(2019) propose evaluating

a classifier on a dataset different from the one it was trained on as all classifiers mentioned in their paper perform worse to in-domain classification.

With all these methods in consideration, a less-biased and well-balanced gold-standard benchmark dataset balanced can be developed.

(12)

4

_{D A T A}

This chapter provides a description of the datasets, data collection and data anno-tation. Section 4.1 presents the three datasets used to collect data: HatEval, Offen-sEval and AbuOffen-sEval. Section 4.2 provides the data distribution and discusses the data collection, class label conversion, and processing constraints. Section 4.3 first presents the methodology for annotating abusiveness, including a description of the annotators, Inter-Annotator Agreement and the differences in pre-annotation and gold labels. Second, it presents the annotations for explicitness. Third, it looks at the measures and distributions in an attempt to reduce bias, including author bias, previous tweet bias, creation date bias, and tweet text bias.

4.1 datasets

Data for this research were collected using tweet IDs from two shared tasks of the International Workshop on Semantic Evaluation. HatEval is the shared task con-cerning hate speech towards women and immigrants and OffensEval is a shared task concerning offensive microposts. Twitter is the domain for both datasets and thus the domain for this research. For the second, more fine-grained binary clas-sification of the explicitness of abuse a third dataset, namely AbusEval is used to evaluate the classification results in a cross-domain experimental design. This sec-tion will further discuss these datasets. The tweet IDs extracted from the two shared tasks, which were used to collect data via the Twitter API (section 4.2) will be further addressed as the seed (tweet) IDs.

4.1.1 HatEval

The HatEval Task is the fifth task of the International Workshop on Semantic Eval-uation 2019 (SemEval): Multilingual Detection of Hate Speech Against Immigrants

and Women in Twitter.1

The task is organized on two related classification subtasks: a main binary subtask A for detecting the presence of hate speech, and a second, more fine-grained subtask B to first identify the target (individual or generalized

group) as specified byWaseem et al.(2017), and second to identify the presence of

aggression (Basile et al.,2019).

Hate Speech (HS) is commonly defined as "any communication that disparages a person or a group on the basis of some characteristics such as race, color, ethnicity,

gender, sexual orientation, nationality, religion, or other characteristics" (Nockleby,

2000).

In total there are 13,000 tweets distributed over a train-set of 9,000 tweets, a development-set of 1,000 tweets, and a test-set of 3,000 tweets. Class distribution over these sets can be found in Table 2. Distribution of the annotated Hate Speech for the two targets of this task, immigrants, and women, can be found in Table 3.

The data are made publicly available without any preprocessing steps, such as the removal of user mentions.

Data were annotated by non-trained contributors on the crowdsourcing platform

Figure Eight2

, and the annotation of the HS is defined as a binary value indicating if HS is occurring against one of the given targets, women or immigrants: 1 if it 1

https://www.aclweb.org/portal/content/semeval-2019-international-workshop-semantic-evaluation

2

http://www.figure-eight.com/

(13)

4.1 datasets 9

Train Dev Test Total

Hateful 3.783 427 1.260 5.470

Non-Hateful 5.217 573 1.740 7.530

Total 9.000 1.000 3.000 13.000

Table 2: HatEval 2019 class distribution over the train-, dev- and test-set.

occurs, 0 if not (Basile et al., 2019). Annotators were given a series of guidelines,

including the definition of hate speech and a list of examples. The reliability of the annotators was validated with a restricted set of "correct" answers. Three indepen-dent judgments for each tweet were required with relative majority voting. Average confidence (a measure combining Inter-Annotator Agreement, IAA, and reliability of the contributor) on the English data is 0.83 (almost perfect agreement) for HS (Basile et al.,2019).

Training Test

Label Imm. Women Imm. Women

Hateful 39.76 44.44 42.00 42.00

Non-Hateful 60.24 55.56 58.00 58.00

Table 3: Distribution percentages across sets for HS binary annotation layer, taken from Table

1ofBasile et al.(2019).

The presence of hateful tweets in the training- and test-set accounts for about

40% of the total tweets in the sets. This presence of hateful tweets is therefore

over-represented with respect to the distribution observed in the data collected from Twitter, as in the whole originally annotated dataset only about 10% of the dataset contained hate speech, which is more in line with the estimates mentioned byFounta et al.(2018);Ribeiro et al.(2018) in chapter 3.

The highest-ranking team scored a macro-averaged (macro) F1-score of 0.651 by training a Support Vector Machine model (SVM) with Radius Basis Function (RBF) kernel only on the provided data, exploiting sentence embeddings from Google’s

Universal Sentence Encoder (Cer et al.,2018) as features.

The authors conclude that hate speech detection against women and immigrants on Twitter is a challenging field, with a large room for improvement and they hope that the dataset made available as a part of this shared task fosters further research on the topic.

4.1.2 OffensEval

The OffensEval shared task concerns a series of shared tasks on offensive language identification organized at the International Workshop on Semantic Evaluation

(Se-mEval).3

OffensEval models offensive content using a hierarchical annotation of three subtasks: (A) a binary classification of offensive and non-offensive microp-osts, a more fine-grained binary subtask (B) whether the micropost has a target

and a multi-class subtask (C) for the target identification (Zampieri et al., 2020).

OffensEval concerns two consequent shared tasks:

1. OffensEval 2019 Task 6 of SemEval 2019: Identifying and Categorizing

Offen-sive Language in Social Media.4

2. OffensEval 2020 Task 12 of SemEval 2020, OffensEval 2: Multilingual

Offen-sive Language Identification in Social Media.5

Offensive microposts are defined as "posts containing any form of targeted or untargeted, non-acceptable language or profanity, which can be veiled or direct". 3 https://sites.google.com/site/offensevalsharedtask 4 https://www.aclweb.org/portal/content/semeval-2019-international-workshop-semantic-evaluation 5 https://www.aclweb.org/portal/content/semeval-2020-international-workshop-semantic-evaluation

(14)

4.1 datasets 10

It includes insults, threats, and posts containing profane language or swear words

Zampieri et al.(2019b).

Data for subtask A were annotated as offensive (OFF) or non-offensive (NOT) and the class distribution over the train and test set can be found in Table 4. The data for OffensEval 2019 and 2020 are enumerated below:

1. Data for the OffensEval 2019 shared task are the Offensive Language

Identifi-cation Dataset (OLID), which contains 14,100 English tweets in total, 13,240 of

which in the training-set and 860 in the test-set (Zampieri et al.,2019a).

2. Data for the OffensEval 2020 shared task are the Semi-Supervised Offensive

Language Identification Dataset (SOLID), which contains 12 million tweets in

total, 9 million of which in the training-set and 3 million in a test-set (

Rosen-thal et al.,2020). As this shared task is still ongoing, the original test-set has

yet to be released and the data used consists of the training-set. An analysis-set, named OffensEval 2020 test, was provided to test classifiers on that is also used in the experiments in this research.

OffensEval 2019 OffensEval 2020 Train Test Total Training Test

OFF 4.400 240 4.640 1.080

NOT 8.840 620 9.460 2.807

Total 13.240 860 14.100 9.089.140 3.887

Table 4: Data distribution for subtask A of English OffensEval 2019 and OffensEval 2020

Training, taken from Table 3 ofZampieri et al.(2019b) and Table 3 ofZampieri et al. (2020).

OLID for SemEval 2019 was annotated using the crowdsourcing platform Figure

Eight6

. Experienced annotators were verified using test questions to discard annota-tors who did not achieve a certain threshold. All the tweets were annotated by two people. In case of disagreement, a third annotation was requested, and the majority usedZampieri et al.(2019a).

OffensEval 2020 Training was collected from Twitter using the 20 most com-mon English stopwords, such as the, of, and, to, etc. to ensure random tweets

were collected. Langdetect7

was used to select English tweets and tweets with less than 18 characters were discarded. OffensEval 2020 Training was then labeled in a semi-supervised manner using democratic co-training with OLID as a seed dataset (Zampieri et al.,2020). Four models were used in the semi-supervised approach:

PMI (Turney and Littman, 2003), FastText (Joulin et al., 2016), LSTM (Hochreiter

and Schmidhuber, 1997), and BERT (Devlin et al.,2018). The average confidence

scores for the ensemble of the four models are added in the publicly available Of-fensEval 2020 Training, but no labels are added to the training data. Offensive tweets (OFF) for the test-set were selected using this semi-supervised approach and annotated manually. 2,500 non-offensive tweets (NOT) were included using this approach without being annotated. Inter-Annotator Agreement was computed on a small subset of offensive (OFF) instances and is 0.988 (almost perfect agreement)

for subtask A (Zampieri et al.,2020).

Both OLID and OffensEval 2020 Training follow the same annotation schema and have been anonymized and normalised using the same methods: No user metadata or Twitter IDs were stored and the URLs and Twitter user mentions were

substituted by an @URL and @USER placeholder (Zampieri et al., 2019a;

Rosen-thal et al.,2020). For OffensEval 2020 Training URLs are present in the training-set.

Tweets containing URLS are however present in the OffensEval 2020 test-set. Table

5shows four examples and their annotations taken from OffensEval 2020 Training.

6

http://www.figure-eight.com/

7

(15)

4.2 collection 11

ID annotation tweet text

A1 OFF @USER @USER He’s an evil law breaker that should be

in prison with his criminal heartless family

A47 NOT i’m not hating on itzy ofc they do their thing but the fact

that yall are a bunch of hypocrites

A49 OFF what a good liar and pretender :joy: :joy: :joy:

A95 OFF @USER This is disgusting - you ought to be ashamed of

yourself.

Table 5: Four example and annotations taken from OffensEval 2020 test.

Seven among the ten top-performing teams for OffensEval 2019 used BERT,

a pre-training of Deep Bidirectional Transformer (Devlin et al., 2018). The

top-performing team used BERT-base-uncased with default parameters, a max sentence length of 64 and trained for 2 epochs, reaching a macro F1-score of 0.829 which is

1.4 better than the second team. The top ten teams for OffensEval 2020 used BERT

Devlin et al.(2018), roBERTa-base and XLM-RoBERTa-large (Liu et al.,2019) trained

on subtask A, sometimes as an ensemble that also included CNNs (Kim,2014) and

LSTMs (Hochreiter and Schmidhuber,1997). The best-performing team achieved a

macro F1-score of 0.9204 using an ensemble of ALBERT models of different sizes. Overall the organizers of OffensEval 2019 and 2020 observe a trend that the best teams in all languages and subtasks used models with pre-trained contextual embeddings, most notably BERT.

4.1.3 AbusEval

AbusEval is a newly created resource aimed to address some of the existing issues in the annotation of offensive and abusive language and is responsible for a dataset

that takes into account the degree of explicitness (Caselli et al.,2020). This dataset

is specifically created to evaluate abusive annotations on an explicit/implicit axes,

as defined by the typology byWaseem et al. (2017). It is an annotation layer that

makes use of the data in OffensEval’s 2019 OLID. This more fine-grained annotation layer is added on top of OLID and will serve for the evaluation of the explicitly and implicitly abusive annotated data of this study. This is a sequential classification: an annotation is either non-abusive (NOT) or abusive (ABU), and if so it is either explicitly abusive (EXP) or implicitly abusive (IMP).

The data distribution of AbusEval is shown in Table 6.

Label Train Test Total

NOT 10.491 682 11.173

EXP 2.023 106 2.129

IMP 726 72 798

Total 13.240 860 14.100

Table 6: Train and test data distribution for the non-abusive (NOT), explicitly (EXP) and

implicitly (IMP) abusive annotations of AbusEval.

4.2 collection

Data for this research were sourced via all the tweet IDs of the dataset used in the HatEval shared task (ch 4.1.1 and a selection based on two heuristics detailed in section 4.2.2 of the 9 million tweet IDs, used in the OffensEval shared task (ch.

4.1.2). Data for this study were re-collected for two main reasons. First, the focal

point of this research is the inclusion of contextual features in the form of user-related features such as the user description, and a micropost-user-related feature in

(16)

4.2 collection 12

the form of the previous tweet: the tweet that the current tweet is responding to. These features are missing in the publicly available datasets for both shared tasks. Second, there are discrepancies in the preprocessing and anonymization steps taken in OffensEval 2020 Training and HatEval.

In order to collect un-preprocessed and not-anonymized data with contextual features, data were re-collected via source tweet IDs from the HatEval and OffensE-val shared tasks in the first two weeks of March 2020 via the Twitter API. Data were

collected via Tweepy8

, an easy-to-use Python library for accessing the Twitter API. To illustrate, a status or tweet is retrieved via the following GET method:

api.get_status(tweet_id, tweet_mode='extended')

Since HatEval’s dataset is annotated for a binary classification of hate speech and OffensEval 2020 Training is annotated for a binary classification of offensiveness, label conversion is necessary in order to create a dataset for the binary classification of abuse.

4.2.1 Twitter API

Data were collected during the first two weeks of March 2020 using the Twitter API

with a registered app named ‘abuse-dataset‘ with APP ID 17476346.9

The Twitter API limits the use rate to 900 requests per 15 minutes. For time-wise feasibility,

data collection was migrated to a server that uses a distributed task queue10

im-plemented in a web-app developed in a popular Python web framework Django.11

This allowed for a continuous run of the data-collection process. The features ex-tracted from the Twitter API for the tweets and Twitter users can be found in Table

30and Table 31, presented in Appendix A.1. Data are stored in a relational database

scheme of tweets, Twitter users, hashtags, URLs, user mentions, and symbols. An overview of the relational database scheme can be found in Appendix A.2.

4.2.2 Class Label Conversion

For HatEval, data were collected via the seed tweets in the original training-, development-and test-set found in Table 2. Tweets that contain hate speech were annotated with the abusive class (ABU), and consequently, tweets that do not contain hate speech were annotated with the non-abusive class (NOT).

The HatEval source data all sets (train-, dev- and test-set) have been re-collected and annotated. There are only 13,000 tweets in the HatEval dataset and after pre-processing, only a small subset of tweets is left to be used in the experiments.

For OffensEval 2020 Training there are no pre-annotated labels available, but instead, the average confidence score for the four ensemble classifiers is available. Data were collected via two heuristics. First, tweets with an average confidence score of over 0.87 were collected and pre-annotated with the abusive class (ABU). Second, tweets with an average confidence score under 0.11 were collected and pre-annotated with the non-abusive class (NOT). Figures 1 and 2 show the distribution of the average confidence scores from the four ensemble classifiers of OffensEval

2020Training.

These thresholds were chosen to constraint the data collection process in time and quantity. OffensEval 2020 Training is annotated for offensiveness and an as-sumption in the pre-annotated labels is that tweets that are highly likely to be non-offensive are also non-abusive (confidence < .11) and that tweets that are highly likely to be offensive are abusive (confidence > .87).

Table 7 shows the class distribution of total collected data per source dataset. 8 https://www.tweepy.org/ 9 https://developer.twitter.com/en 10 https://docs.celeryproject.org/en/stable/index.html 11 https://www.djangoproject.com/

(17)

4.2 collection 13

Figure 1: Average confidence scores for the

non-abusive (NOT) class from Of-fensEval 2020 Training.

Figure 2: Average confidence scores for the

abusive (ABU) class from OffensE-val 2020 Training.

HatEval OffensEval

Train Dev Test c < .11 c > .87 Total ABU 3.783 427 1.260 na. 22.725 28.195 NOT 5.217 573 1.740 11.385 na. 18.915 Total 9.000 1.000 3.000 11.385 22.725 47.110

Table 7: Pre-annotated class distribution of data sourced from HatEval and OffensEval via

seed tweet IDs.

4.2.3 Data Distribution

The tweet IDs sourced from HatEval’s dataset are created within a time frame of August 2010 until September 2019, with the majority of the tweets (4.072 tweets) created in the last five months of the data collection period. The collected tweets for OffensEval 2020 Training have a creation date within a time frame of April 2009 until October 2019, with the majority of the tweets (26.625) in the last four months of data collection. October and August account for 24.896 of these 26.625 total collected tweets, whereas September only accounts for 32 tweets.

Since data were re-collected via the Twitter API within the first two weeks of March, some tweets or users might have been deleted in between the publishing date of the HatEval dataset and OffensEval 2020 Training, both by Twitter users

themselves or by the medium Twitter as a result of violating its guidelines.12

Table 8 shows the distribution of the pre-annotated classes over the data that have been retrieved via the Twitter API. These represent the active tweets for which no exception from the Twitter API was returned. Examples of these exceptions are shown in 3. Some information, tweets or users, have been lost because of these Twitter API exceptions. The HatEval dataset was reduced from 13,000 tweets in the original dataset to 7,708 active tweets retrieved via Twitter API as of March 2020. OffensEval 2020 Training was reduced from 34,110 tweets to 26,980 active tweets retrieved via Twitter API as of March 2020.

Train Dev Test c < .11 c > .87 Total ABU 2.010 225 726 na. 17.215 20.176 NOT 3.442 341 964 9.765 na. 14.512 Total 5.452 566 1.690 9.765 17.215 34.688

Table 8: Pre-annotated class distribution of the data after collection via the Twitter API.

An interactive visualization of the demographics of both the HatEval and

Offen-sEval dataset over the abusive, non-abusive class or both can be found here.13

. The images below show two of the interactive visualizations.

12

https://help.twitter.com/en/rules-and-policies

13

(18)

4.3 annotation 14

Figure 3: Distribution of the exceptions and

tweets that are replying for the HatEval dataset.

Figure 4: Most used terms by count for the

non-offensive (NOT) class tweets extracted from the OffensEval dataset.

4.2.4 Processing Constraints

Several processing constraints, outlined in the enumeration below ensure that the collected tweets contain all the contextual features in the @replies dataset. Moreover, tweets that do not reply to a previous tweet were excluded from the @replies dataset. The features or model fields that support these decisions can be found in Appendix A.1.

First, the tweet text should not be truncated. This is achieved by using the

tweet_mode='extended'argument, as shown in the Tweepy illustration in 4.2. Sec-ond, the tweet should be active so that the @replies dataset does not include ex-ception messages. Third, tweets cannot be a quoted tweet, a case where the tweet is quoting another tweet. Fourth, the tweet should have a parent tweet that it is replying to. Fifth, this parent tweet should not have the same author, c.q. replying to oneself.

To summarise:

1. _truncated: Tweets cannot be truncated. 2. active: Tweets must be active.

3. is_quote_status: Tweets cannot be quoted.

4. in_reply_to_status_id: Tweets must be a reply to another tweet. 5. _{in_reply_to_self}: Tweets cannot be a reply to the author himself.

Table 10 shows the class distribution after the processing constraints and rep-resent the dataset available for annotations. In total there were 1,129 tweets in the @replies dataset available for the judgement of the human annotators that met the summarised preprocessing criteria. The HatEval dataset was reduced from the

7,708 active tweets retrieved via the Twitter API as of March 2020 to 1,447 tweets

after the application of the preprocessing steps. OffensEval 2020 Training was re-duced from the 26,980 active tweets retrieved via the Twitter API as of March 2020 to 8,674 tweets after the application of the preprocessing steps.

Train Dev Test c < .11 c > .87 Total ABU 411 64 198 na. 5.818 6.491 NOT 406 54 314 2.856 na. 3.630 Total 817 118 512 2.856 5.818 10.121

Table 9: Pre-annotated class distribution of the @replies dataset after the application of

pre-processing steps.

4.3 annotation

The name of the annotated dataset is the @replies dataset. Tweets have been anno-tated by 41 student annotators including the researcher for the binary classification of abuse, in three annotation rounds:

(19)

4.3 annotation 15

1. Abusive and non-abusive binary annotation round by 41 student annotators

including the author of the research.

2. Gold label re-annotation of abusive and non-abusive binary annotation for the

tweets that did not reach consensus.

3. Explicit and implicit annotation round by the author of the research for all

tweets that were annotated in the previous two rounds as abusive.

The annotations from the first annotation round resulted in a total of 1,958 an-notations, 749 of which had conflicting labels (i.e. two or more annotators gave dif-ferent labels to the tweet). These conflicting tweets have been re-annotated by the researcher in a second, gold-label annotation round. Inter-Annotator Agreement Fleiss Kappa score for each group and pair-wise Cohen Kappa scores are shown in Appendix B. Of the abusive tweets 3,256 were annotated as explicit (EXP) and only

403as implicit (IMP) by the researcher of this study in a third annotation round.

Figure 5: Graphical interface for the annotations with the user features, previous tweet and

two buttons for the binary classification.

Annotation guidelines for the abuse/no-abuse classification can be found in Ap-pendix D.1. Annotation guidelines for the explicit/implicit classification can be found in Appendix D.2.

Annotations were facilitated by the Django web-app previously introduced in

4.2. A visualization of the annotation app is shown in Figure 5. The user features

are shown on the left and include the profile image, screen name, user name, user description, and user location as well as the number of followers and friends. For both the current tweet and the previous tweet (shown in grey) the text and number of favorites and retweets are shown.

4.3.1 Abusiveness

There are 1,447 annotated tweets from the HatEval source data, and 8,674 annotated tweets from OffensEval 2020 Training in the @replies dataset.

Train Dev Test c < .11 c > .87 Friends Source missing Total

ABU 410 79 267 6 2.897 0 0 3.659

NOT 407 39 245 2_.850 2_.921 21 8 6_.491

Total 817 118 512 2.856 5.818 21 8 10_.150

1.447 8.695 8

Table 10: Class distribution of abusive (ABU) and non-abusive (NOT) tweets for the @replies

(20)

4.3 annotation 16

There are 29 tweets that are sourced by friends, which are followers of the users in our source data whom also are being followed by that user (follow + follow back). 21 of these are friends with users from the OffensEval dataset. In a promising research direction, over 35.000 tweets from users that are friends with Twitter users in our source data have also been scraped and a very small portion (29) was annotated. Foremost this is due to time-wise feasibility of this research. Moreover, no boosting mechanism was used to promote abusive tweets except for scraping tweets from abusive users. Inspection of the tweets and the annotated showed that the vast majority of these tweets were non-abusive. Only a small portion was included, to maintain a distribution of roughly one-third abusive tweets and two-third non-abusive tweets.

To conclude, from HatEval and OffensEval source tweets plus the tweets sourced by friends there are 10,150 annotated tweets for abusiveness: 3,659 annotated as the abusive (ABU) class and 6,491 as the non-abusive (NOT) class.

Annotators

A total of 44 student annotators and the researcher divided over 14 groups anno-tated the tweets for the abusive/non-abusive binary classification. Each student was assigned 150 tweets, 50 in common with the other student annotators in its group, and 100 tweets that they annotated individually without overlap with other student annotators. The researcher of this study annotated all tweets in the anno-tation set. 42 annotators annotated 150 tweets and only two annotators annotated less than that: ann27 annotated only 17 tweets and ann10 annotated 103 tweets. The annotations were part of a Bachelor’s Information Science course for which credit was awarded if 100 tweets were annotated and a bonus credit added if 150 tweets were annotated. It is likely that ann10 annotated just over the threshold of

100 annotations as a result of this. The student annotators provided 6.420

annota-tions. Demographics of the annotation (Bender and Friedman,2018) for 41 of the

45annotators can be found in Appendix C.

Inter-Annotator Agreement

Inter-Annotator Agreement was calculated in both Cohen’s Kappa Score for pair-wise annotators in a group and overall Fleiss Kappa scores of a group. Both Co-hen’s Kappa and Fleiss kappa metrics for all groups are presented in the tables in Appendix B. Mean Fleiss kappa score over all groups is 0.5506κ (moderate agree-ment). A threshold of 45 overlapping annotations was decided upon to be met to ensure consistency of the IAA evaluation metrics. Ann27, who only annotated 17 tweets, consequently does not meet this threshold and is removed from the Inter-Annotator Agreement evaluation. Fleiss Kappa is especially low for groups 2, 5 and

12. Further inspection of these groups did not yield conclusive results, except that

the overall majority of annotations that did not achieve consensus had a class distri-bution of three-to-one: either three annotations as abusive and one as non-abusive or vice-versa. In group 2: 78.26%, group 5: 80.0% and group 12: 80.85% of the annotations had a three-to-one class distribution.

Gold Label Annotation

In total there were 749 annotations for which there was no consensus achieved in the annotations: at least one annotation differed from the other annotations. Because of the difficulty of the task, as shown by the low mean Fleiss Kappa score another annotation round was conducted by the researcher of this paper to obtain gold annotation labels. The researcher re-annotated the 749 tweets without consensus and provided gold labels for these annotations.

(21)

4.3 annotation 17

Pre-annotated & Gold Labels

Inspection of the distribution of the pre-annotated labels in comparison to the gold labels after the annotation rounds for both the HatEval dataset and OffensEval 2020 Training proved relevant for two main reasons.

First, for HatEval 497 (34.35%) and for OffensEval 2020 Training 2,934 (33.74%) of the pre-annotated labels were either changed from abusive (ABU) to non-abusive (NOT) or vice-versa. The main difference between the annotation schemes consists of the additional inclusion of context-related features in the annotation process in this research. Roughly a third of the annotations changed, arguably due to the inclusion of these context-related features. The relevant insight here is that context changes annotations.

Second, offensive labelled tweets from OffensEval 2020 Training, with an aver-age confidence of the ensemble classifiers over .87 have been reduced from 5,818 offensive tweets to only 2,987 abusive tweets. This is a strong indication that of-fensiveness does not imply abusiveness. Tweets that contain offensive swear words can very well be non-abusive. Following this indication, we can say that offensive swear words are used to enforce both non-abusive and abusive messages.

gold pre-annotation tweet ID 1029489592905347074

abuse no-abuse previous

The Westminster terrorist has been named as Salih Khater, a Sudanese Muslim migrant (now UK national) from Birmingham. https://t.co/HOiAtBIpV5

current @USER Also, water is wet. #SendThemBack

abuse no-abuse previous My barber loyal. Just like me https://t.co/EflMIchbxI current @USER You a lie and your barber a h** too n**** your

line up looking like a serrated knife

no-abuse abuse previous

These are the heroes who risk their lives every day. #CopsLivesMatter https://t.co/3wC8LSX4rW current @USER Get these criminals off the roads and off

the streets! #BackTheBlue #BuildThatWall

i swear, i feel like ppl be taking my tweets just switching some words. LOLOL but then again

current @USER Girl people really do that, like damn b**** you so selfish you don’t wanna rt me and get me clout

Table 11: Tweets with gold and pre-annotated label differences for tweets with source

HatE-val 2019 dataset.

Table 11 details the difference in annotation between pre-annotated labels and the gold labels from this research for four tweets from HatEval source data. The first tweet with ID 1029489592905347074 shows a tweet that is very sarcastic, but it’s only made relevant by the content in the previous tweet. Implying that the statement that a terrorist is a Sudanese Muslim is as relevant the statement that water is wet is very implicitly abusive. The second tweet with ID 919096901818925056 shows a tweet with very explicit content that is pre-annotated as non-abusive, when it is very abusive. This, in combination with the low Fleiss Kappa scores that we saw in the annotation rounds in this research, reinforces the argument that annotating abusiveness in microposts is a difficult task in and of itself.

The third tweet with ID 1039390652755267584 shows that the current tweet is reinforcing the statement made in the previous tweet that cops are heroes who get criminals off the streets. This is however only made relevant by the context provided in the previous tweet. The fourth tweet with ID 919097303457132544 again shows that the current tweet is responding and the offensive or abusive part "damn b****" is a generalized statement. This generalized statement reinforces what the previous tweet is actually saying, and this abuse in this previous tweet does not hold a target, making it non-abusive. Because the previous tweet is non-abusive and the current tweet only reinforces the statement, it should therefore also be non-abusive.

(22)

4.3 annotation 18

Mitch McConnell fractures his shoulder after tripping https://t.co/0IENh94lAh current @USER Thoughts and prayers :vomit:

Looking forward to joining @USER today on @USER in #Kentucky to discuss the news of the day. Tune in live at 4:05 ET today: https://t.co/TKlXiZ1ism

current @USER @USER @USER Hopefully the news of the day is that you_{are retiring.}

Closed my eyes and just had the weight of my city of Staten Island my family and the wu tang legend @USER on my back

https://t.co/wt6cg7nukl

current @USER @USER You killed that s*** @USER

no-abuse abuse previous WHY DID I TAKE SO LONG TO TRY APEX I LOVE THIS S*** current @USER That s*** fun if your whole team kill thirsty

Table 12: Tweets with gold and pre-annotated label differences for tweets with source

Offen-sEval 2020 Training.

Table 12 shows that the data are indeed annotated for offensiveness and not abusiveness and the assumption that offense equals abuse as mentioned in section

4.2.2 does not hold. The first two tweets (tweet IDs: 1158331171819966465 and

1159528346373349376) do not contain swear words but do contain a very sarcastic

tone. The first tweet, with ID 1158331171819966465, contains a throw-up emoticon after the "Thoughts and Prayers", which is unfortunately not rendered in the Latex

PDF14

), making the sarcastic tone very explicit. The average confidence score from the four ensemble classifiers is 0,0951. The second tweet is very sarcastic in nature, implying that the author of the previous tweet should retire. This tweet is not in the OffensEval bucket of confidence scores under .11, making the case unique. It is likely that this tweet was annotated as offensive with a high confidence score because of the hashtag #SendThemBack, which was not annotated as abusive in our annotation round. The two following tweets (tweet IDS: 1187550316981391360 and 1186821711511130114) contain the swear word s***, but the context of that tweet shows that it is merely a reinforcement to enforce their statement, which is a positive message.

In Appendix F.1 and F.2 you will find a total of ten tweets per dataset that contain differences in the pre-annotated and gold labels. These analyses also show the difference in data annotated for hate speech or offensiveness and abusiveness.

4.3.2 Explicitness

Data that have been annotated as abusive by the student annotators have also been annotated for the explicit (EXP) and implicit (IMP) axes as defined in the typology byWaseem et al.(2017). In total there are 3,659 abusive annotations, 3,256 annotated

as explicitly abusive (EXP), and 403 as implicitly abusive (IMP). The distribution of the explicit and implicit class is much more skewed: the implicitly abusive represent only about 11% of the total annotated tweets. Table 13 shows the distributions of the classes over the sets in the HatEval dataset and OffensEval 2020 Training.

The few (6) annotations with low confidence scores for the offensiveness of the tweet from the ensemble classifiers in OffensEval 2020 Training are all annotated as implicitly abusive. Furthermore, 58 annotations with high confidence scores for the offensiveness of the tweet from the ensemble classifiers are annotated as implicitly abusive. These tweets are thus classified as explicit by the ensemble classifiers of OffensEval 2020 Training, but annotated as implicit. Inspection of these 58 tweets could further indicate why. An example is tweet ID 1161509981348401152 with an average confidence score of 0.8946 for offensiveness: "@USER Except Trump is NOT 14

(23)

4.3 annotation 19

Train Dev Test c < .11 c > .87 Total

EXP 215 38 164 0 2.839 3.256

IMP 195 41 103 6 58 403

Total 410 79 267 6 2.897 3.659

756 2.903

Table 13: Class distribution of explicitly abusive (EXP) and implicitly abusive (IMP) for the

@replies dataset.

A RACIST and that makes you a liar...". This tweet is likely tagged as offensive because of the capitalized RACIST, but is annotated in our dataset as implicitly abusive since it implies that the @USER has a deceiving nature. Another example is tweet ID 1185903137573560322 with an average confidence score of 0.8970 for offensiveness: "@USER Shut ur mouth - you couldn’t even wipe ur own ass for

95% of thos". This tweet is likely classified as offensive, because of ass and shows

the difficulty of annotating abuse, as it is annotated as implicitly abusive in our annotations.

4.3.3 Bias Reduction

In order to balance the dataset as much as possible, as mentioned in detail in chapter

3, two dimension of bias, author bias and bias in the contextual feature of the

previous tweet, are accounted for. There are more potential dimension of bias that have been introduced in section 3.2, but only two of them have been accounted for. In order to account for author bias, tweets written by the same author were removed from the @replies dataset used in the experiments. Some of the previous tweets did not contain texts. Empty previous tweets are also removed from the @replies dataset used in the experiments.

Because of these two steps taken, all annotated data comes from unique single authors and contains textual previous tweets that the current tweet is responding to. This resulted in a more evenly distributed annotation set size of 8,129 tweets.

ABU 221 62 170 6 2_.401 0 0 2_.860

NOT 235 24 154 2.485 2.346 18 7 5.269

Total 456 86 324 2.491 4.747 18 7 8_.129

866 7.256 7

Table 14: Class distribution of abusive (ABU) and non-abusive (NOT) tweets for the @replies

dataset when duplicate authors and empty previous tweets are removed.

EXP 136 34 99 0 2.354 0 0 2.623

IMP 85 28 71 6 47 0 0 237

Total 221 62 170 6 2.401 0 0 2.860

453 2_.407 0

Table 15: Class distribution of explicit (EXP) and implicit (IMP) abusive tweets for the

@replies dataset when duplicate authors and empty previous tweets are removed.

Table 14 shows the abusive/non-abusive class distribution of the @replies dataset used for the experiments. There are no duplicate authors and no tweets with empty previous tweet texts, however, there are still 362 tweets that share their parent tweet with at least one other tweet. Table 15 shows the class distribution of explicit and implicit abusive class, where there are still 214 tweets that share their parent tweet.

(24)

4.3 annotation 20

Author Bias

Of the 10,150 annotated tweets, there are 148 tweets that have the same author.

144 duplicate authors are from the HatEval 2019 dataset and only four are from

OffensEval 2020 Training.

For instance, one of the users has twelve tweets to his name and is a bot that responds with the time when you ask it the time: "@USER the time is indeed

08/12/2019 09:09:36 thanks for contacting us.". Many of the duplicate authors only

have two tweets to their name, however, there are some outliers. There are 15 au-thors that have authored more than five tweets in the @replies dataset. One author, originally from the HatEval 2019 dataset has 46 tweets to his name and does not appear to be a bot or spam account.

Previous Tweet Bias

There are two types of bias related to the previous tweets: empty previous tweets and duplicate previous tweets. There are 1603 tweets in the @replies dataset that have an empty previous tweet: indicating that either the Twitter author has removed the tweet’s text or there was an exception returned from the Twitter API as shown in

4.2. There are 405 tweets that have the same previous tweet that they are responding

to. It is likely that these duplicate parent tweets are present in the HatEval and OffensEval datasets due to their data collection methodology.

Inspection of the tweets that share a parent tweet showed that for the abu-sive/not abusive annotations, two tweets with the same parent tweet do not nec-essarily also share the same class. In many of the cases, the classes are different: one is abusive and the other is not or the other way around. For the overwhelm-ing majority of tweets with the same parent tweet, the gold label annotations were equal. For the explicit/implicit axes, all of the tweets sharing the same parent tweet had the same class label. Examples of these phenomena are shown in the Table 16 below.

gold tweet ID 1161145025620672512

abuse previous F*** THIS SOB! Twitter you know what to do. FIND THIS SOB! https://t.co/YkKYxXb5Wz current @USER What a piece of s***

gold tweet ID 1160662948701884417

no-abuse previous F*** THIS SOB! Twitter you know what to do. FIND THIS SOB! https://t.co/YkKYxXb5Wz current @USER What the f*** Just, what the f***.

gold tweet ID 1158331171819966465

implicit previous Cory Booker & Kamala Harris competing for Most Hysterical Woman at the Kavanaugh hearings. current

@USER Ann, ever think you might be a bit hysterical yourself. You lost your ability to reason long ago. Not sure when you had the heartectomy. One thing for sure. You don’t belong in the human race. Damn.

gold tweet ID 1038010523852398592

implicit previous Cory Booker & Kamala Harris competing for Most Hysterical Woman at the Kavanaugh hearings. current @USER I thought you already had the title. Oh wait, you said most hysterical WOMAN.

Table 16: Four example tweets that share the same parent tweet for both the abuse/no-abuse

and explicit/implicit annotations.

Creation Date Bias

Figures 6 and 7 show the abusive and non-abusive class distribution of the creation dates of tweets sourced from HatEval’s dataset and OffensEval 2020 Training for the months with more than 5 tweets per month. Appendix E shows the full creation date distribution per class per dataset.

Figures 8 and 9 show the class distribution of explicit and implicit tweets per dataset. As shown in chapter 3 it is relevant to look at these distributions in order to accurately identify possible bias in the @replies dataset. Figure 6 shows an even distribution of the abusive and non-abuse class for the tweets sourced from HatEval, and figure 6 shows an expected distribution of less abusive tweets than non-abusive tweets in all months of OffensEval 2020 Training.

(25)

4.3 annotation 21

Figure 6: Abusive and non-abusive class

dis-tribution of the Tweet creation dates for annotated tweets with-out bias sourced from HatEval’s dataset.

Figure 7: Abusive and non-abusive class

dis-tribution of the Tweet creation dates for annotated tweets without bias sourced from OffensEval 2020 Training.

Figure 8: Explicit and implicit class

distribu-tion of the Tweet creadistribu-tion dates for annotated tweets without bias sourced from HatEval’s dataset.

Figure 9: Explicit and implicit class

dis-tribution of the Tweet creation dates for annotated tweets without bias sourced from OffensEval 2020 Training.

When looking at the distribution of the explicit and implicit classes however, there is much more bias in both datasets. Figure 8 shows that explicit (EXP) anno-tated tweets are mainly coming from June 2017, October 2017 and September 2019 for HatEval and Figure 9 shows that all implicit tweets from OffensEval 2020 Train-ing are comTrain-ing from August 2019. For OffensEval this is to be expected, as there are only 6 implicitly annotated tweets.

Tweet Text Bias

There are a total of three pairs of tweets that contain exactly the same tweet text. These three pairs of duplicate tweet text tweets are included in the @replies dataset and this is another possible bias. Inspection showed that they are often small-sized and very generalizable tweets such as "@USER Shut the f*** up" (tweet IDs

1157827990597898242and 1162113791817068566). Only one of the pairs of duplicate

tweet text tweets share the same previous tweet: in the case of the tweet with ID