Contextual Feature Abuse Detection
The Use Of Author And Parent Micropost Features
For Abuse Detection Datasets
L.E.D. (Louis) de Bruijn
Master Thesis Information Science Informatiekunde
Louis Emile Deodatus de Bruijn s2726327
1
A B S T R A C T
In recent years, the exploration of automatic detection of online abusive language on social media has consolidated in a popular research direction. However, many of these systems make use of unbalanced and biased datasets. First, this research provides a critical overview of these biases and tries to subsequently account for them in the creation of a newly annotated abusive language dataset on the Twit-ter domain. Second, this research aims to investigate the inclusion of contextual features in the collection, annotation, and classification process.
This research presents a dataset with a total of 10,150 annotated tweets for a first binary classification of abuse/no-abuse and a second, more fine-grained binary classification of explicit and implicit abuse. The dataset has two main characteristics: (1) it solely contains tweet @replies with the inclusion of the previous tweets and (2) it heavily contains offensive swear words, representing informal Internet speech found abundantly on Twitter.
The performance of a simplistic supervised machine learning model in a cross-dataset experimental design has shown that the cross-dataset generalizes the concept of abuse well, but the subtypes of explicit and implicit forms of abuse not so well. A distinction made in this research is that offensiveness does not imply abusiveness. Future research can use this dataset to further investigate the relationship between abuse and offensiveness in a @replies network.
WARNING: This paper contains tweet examples and words that are offensive in nature.
C O N T E N T S
1 abstract 1 2 introduction 1 3 related work 3 3.1 Abusive Language . . . 3 3.2 Bias . . . 43.3 Best Practices For Dataset Creation . . . 6
4 data 8 4.1 Datasets . . . 8 4.1.1 HatEval . . . 8 4.1.2 OffensEval . . . 9 4.1.3 AbusEval . . . 11 4.2 Collection . . . 11 4.2.1 Twitter API . . . 12
4.2.2 Class Label Conversion . . . 12
4.2.3 Data Distribution . . . 13 4.2.4 Processing Constraints . . . 14 4.3 Annotation . . . 14 4.3.1 Abusiveness . . . 15 Annotators . . . 16 Inter-Annotator Agreement . . . 16
Gold Label Annotation . . . 16
Pre-annotated & Gold Labels . . . 17
4.3.2 Explicitness . . . 18
4.3.3 Bias Reduction . . . 19
Author Bias . . . 20
Previous Tweet Bias . . . 20
Creation Date Bias . . . 20
Tweet Text Bias . . . 21
5 methods and experiments 22 5.1 Methodology . . . 22
5.1.1 Support Vector Machine Model . . . 22
5.1.2 Vectorizer . . . 23 5.1.3 Feature Pipeline . . . 23 5.1.4 Train/Test Split . . . 23 5.2 Fine-tuning Experiments . . . 24 5.2.1 Vectorizer . . . 24 5.2.2 Preprocessing . . . 24 5.2.3 Features . . . 25 5.3 Best-performing Experiments . . . 26 5.4 Cross-dataset Experiments . . . 26
5.5 Most Informative Tokens . . . 27
5.6 Previous Tweet Annotation . . . 28
6 discussion 30 6.1 Fine-tuning Experiments . . . 30 6.2 Cross-dataset Experiments . . . 31 6.2.1 HatEval . . . 31 6.2.2 OffensEval . . . 31 6.2.3 AbusEval . . . 32
6.3 Most Informative Tokens . . . 32
6.4 Annotation Process . . . 33
6.5 Boosting Abusive Content . . . 34
CONTENTS 3
7 conclusion 35
7.1 The Most Relevant Biases . . . 35
7.2 Controlling For Bias In Data Creation . . . 36
7.3 Including Context-related Features . . . 36
7.4 Limitations . . . 37
7.5 Future Work . . . 37
8 acknowledgements 39 Appendices 43 a database objects 44 a.1 Tweet and TwitterUser Object Features . . . 44
a.2 Relational Database Scheme . . . 45
b inter-annotator agreement 46 c annotator demographic of the annotation 49 d annotation guidelines 50 d.1 Abuse Annotation Guidelines . . . 50
d.2 Explicit Annotation Guidelines . . . 51
e tweet creation distribution 52 e.1 Abusive and non-abusive . . . 52
e.2 Explicit and implicit . . . 53
f example tweet excerpts 54 f.1 HatEval Example Tweets . . . 54
f.2 OffensEval 2020 Training Example Tweets . . . 55
g hypertune parameters results 56 g.1 Vectorizer Parameters . . . 56
g.2 Preprocessing Parameters . . . 56
g.3 Features Performance . . . 57
h hypertuned experiments results 58 i cross-dataset results 59 i.1 HatEval . . . 59
i.2 OffensEval . . . 60
2
I N T R O D U C T I O N
As the public debate about violence, racism, misogyny, and hate resurfaced in the last couple of months, it has also transcended into the domain on the Internet. As a result, the detection of abusive language in user-generated online content has become an issue of increasing importance. For instance, organizers of the Stop Hate
For Profit movement1
have very recently initiated a campaign to put a halt to the incitement of violence prevalent on Facebook and have gained tremendous support,
both in the United States and the Netherlands.2
Because of the vast quantity of user-generated content on these social media platforms, manually checking all possibly abusive content has proven unfeasible (Schmidt and Wiegand, 2017). The Natural Language Processing (NLP)
commu-nity has responded with the development of several classification systems for the
automatic detection of abusive language (Jurgens et al.,2019). These supervised
ma-chine learning systems require an annotated dataset with both abusive and benign microposts to train on.
Online abusive language is commonly defined as "an omnibus term that in-cludes hurtful, derogatory or obscene utterances expressed towards a targeted group
or specific members of that group" (Davidson et al.,2017;Jurgens et al.,2019;
Wie-gand et al.,2019).
Because abusive language is an elusive phenomenon that occurs very sparsely, with estimates ranging from 0.1 to 3% of tweets containing abusive content, random sampling microposts for these datasets would result in too little abusive content (Schmidt and Wiegand, 2017; Founta et al., 2018; Ribeiro et al., 2018). Many of
these datasets are therefore collected via some sort of focused sampling method,
introducing a wide variety of biases that often go undetected (Wiegand et al.,2019).
Topic bias, domain bias, bias towards subtypes of abuse, such as explicitly abusive content, and author bias are all present. A lack of a clear definition, typology of the different subtasks, and well-documented annotation guidelines further deteriorate
the objectivity of these datasets (Ribeiro et al.,2018;Waseem et al.,2017). Waseem
et al.(2017) have proposed a two-fold typology to synthesize the different subtasks
under the umbrella term of online abuse. The typology considers first whether the abuse is directed at a specific target or generalized, and second the degree to which it is explicit.
Even though several surveys have outlined the importance of these biases, most
prominently the paper byWiegand et al.(2019), the field is still to benefit from an
less-biased benchmark dataset controlling for bias (Schmidt and Wiegand,2017).
Previous work suggest taking knowledge-based contextual information into
ac-count, both in the collection, annotation, and classification process (Schmidt and
Wiegand,2017;Wiegand et al.,2019). This research aims to identify the most
rele-vant biases in abusive language datasets in order to create a more balanced dataset and to investigate the performance of including context-related features in both the annotation process and the experimental settings.
The research questions of this study are
1. What are the most relevant biases in abusive language datasets?
2. How are the most relevant biases controlled for in data creation?
3. Does the inclusion of context-related features improve annotations and
auto-matic classification scores? 1
https://www.stophateforprofit.org/
2
https://www.nrc.nl/nieuws/2020/07/02/stop-hate-for-profit-heineken-tomtom-boycotten-facebook-a4004704
introduction 2
This paper follows the typology provided by (Waseem et al.,2017) and presents
a newly annotated dataset, the @replies dataset, of around 10,000 microposts on Twitter for a first binary classification of abusiveness and a second fine-grained binary classification of the explicitness in its abuse.
The body of the research concerns the identification of several biases in this dataset and evaluates classification scores in a cross-dataset experimental design of previously collected datasets from two shared tasks of the International Workshop on Semantic Evaluation: HatEval and OffensEval. The @replies dataset is focused on contextual features, most importantly the previous or parent tweet that tweets are responding to and the user description. Responding to other microposts is a key characteristic of social media platforms and an integral part of the spread of knowledge through social media networks. This study aims to further develop the novel research direction of including these contextual features in the annotation and classification process.
To answer our research questions it is useful to discuss the previous literature concerning the different forms of bias and the complexity of annotating abuse in the next chapter. Chapter 4 provides a description of the data and material. Chapter
5 provides a detailed overview of the methodology, including data collection,
an-notation, processing, experimental design and presents the results of the different experiments. Chapter 6 thoroughly discusses the implications and findings. Chap-ter 7 will conclude the findings, answer the research questions, discuss the studies’ limitations and proposes future research directions.
3
R E L A T E D W O R K
This chapter presents related work in the field of abusive language detection. First, it details the inherent difficulties of online abusive language. Secondly, it discusses the field’s current advances and datasets for the automatic detection of online abu-sive language. Thirdly, it sharply defines online abuabu-sive language, its target and the degree to which it is explicit. Next, it introduces several biases identified in the field: annotation bias, bias in explicit and implicit markers, topic bias, author bias and several ways to reduce these biases. At last, an evaluation of these biases is discussed.
3.1
abusive language
Detection of abusive language in user-generated online content becomes more im-portant as social media platforms have become under increasing public and
gov-ernmental pressure to tackle the issue of online abuse (Schmidt and Wiegand,2017;
Waseem et al.,2017). The prevalence of online abuse has shown that social media
platforms can be used to incite mobs of people to violence as extreme ideas have a
tendency to spread throughout social networks (Brady et al.,2017). Moreover,
re-cent surveys indicate that online abusive behavior happens much more frequently
than previously suspected (Jurgens et al.,2019).
Nonetheless, online abusive language remains an elusive phenomenon that oc-curs very sparsely, with estimates ranging from 0.1 to 3% of tweets containing
abu-sive content (Founta et al.,2018; Ribeiro et al., 2018). Because of this, there is a
very skewed distribution between benign and abusive microposts (Schmidt and
Wiegand,2017). Even though abusive content occurs sparsely, manually inspecting
all possibly abusive content in online social media networks is unfeasible (Schmidt
and Wiegand,2017).
The NLP community has responded by developing several classification systems
for the automatic identification of abusive language (Jurgens et al.,2019). These
su-pervised machine learning systems require an annotated dataset containing both abusive and benign microposts. Random sampling for these datasets would result in too little abusive content, which is why many choose some sort of focused
sam-pling method (Schmidt and Wiegand, 2017). These introduce a variety of biases,
that often go undetected. Addressing these biases results in much lower
classifica-tion scores of these systems (Wiegand et al.,2019).
Online abusive language is a very broad concept of which the categorization has proven difficult: a broad categorization is likely too computationally inefficient, yet narrow categorization risks further reinforcing harm to affected community
mem-bers (Jurgens et al.,2019). Detection of hate speech, derogatory language and
cyberbul-lying are all different subtasks that have been grouped under the umbrella term of
online abusive language (Waseem et al.,2017). The lack of a general definition has
resulted in contradictory annotation guidelines, making it difficult to correctly
com-pare their results (Waseem et al.,2017;Schmidt and Wiegand,2017). For instance,
some microposts identified as hate speech byWaseem and Hovy (2016), are only
considered offensive byNobata et al.(2016) andDavidson et al.(2017).
Online abusive language is commonly defined as "an omnibus term that in-cludes hurtful, derogatory or obscene utterances expressed towards a targeted group
or specific members of that group" (Davidson et al.,2017;Jurgens et al.,2019;
Wie-gand et al.,2019).
3.2 bias 4
Waseem et al.(2017) have proposed a typology to avoid the contradictions and
synthesize the different subtasks under the umbrella term of online abuse. This two-fold typology considers first whether the abuse is directed at a specific target or generalized and second the degree to which it is explicit. Explicit abusive language is that which is unambiguous in its potential to be abusive, using hateful terms (Barthes,1957). Implicit abusive language is that which does not immediately
im-ply or denote abuse, often obscuring its true nature by the lack of hateful terms (Waseem et al.,2017).
Schmidt and Wiegand(2017) mention a lack of a benchmark dataset and suggest
that the community would benefit from a dataset underlying a commonly accepted definition of the task. They conclude that it remains to be seen whether abusive language classification methods can solve the problems posed or can only solve certain subtypes of abuse.
3.2
bias
• Contradictory annotation guidelines as a result of a loosely defined concept
of online abuse and vague documentation.
• Topic bias as a result of selective sampling, such as bias towards certain
sub-types of abuse such as misogyny, racism or sexism or towards certain hashtags on Twitter or topics such as American politics or LGBTQ communities.
• Overemphasis on explicit forms of abuse as a result of selective sampling
methods via an abusive language lexicon.
• Underexposure of implicit forms of abuse as a result of previously mentioned
sampling methods and the inherent difficulties of identifying and annotating implicit abuse.
• Author bias as a result of selective sampling, where only a small group of
authors accounts for a majority of the abuse classes, topics or subtypes of abuse.
Several studies consider the importance of bias introduced in the domain of online abusive language detection and the considerations when interpreting the
re-sults of these studies (Klubiˇcka and Fernández,2018;Schmidt and Wiegand,2017;
Wiegand et al.,2019). In particular,Wiegand et al.(2019) shows that many datasets
model the bias instead of the phenomenon of abusive language and that classifica-tion scores on popular datasets are much lower under realistic settings in which the bias is reduced.
Apart from issues with contradictory annotation guidelines due to a lack of a proper definition, often these guidelines remain fairly vague and undocumented (Schmidt and Wiegand,2017). Several studies reflect on the difficulties of
annotat-ing online abuse, especially implicit forms of abuse (Waseem et al.,2017;Schmidt
and Wiegand,2017). As such, there seems to be little consensus on the definitions
of the different subtasks among annotators, with Kappa scores ranging from 0.6κ
to 0.8κ (Waseem et al., 2017). For instance, both Waseem and Hovy (2016) and
Davidson et al.(2017) find that annotators consider racists or homophobic terms to
be hateful, but consider words that are sexist and derogatory towards women to be only offensive. As a result of these considerations, several authors favor expert annotators with domain-knowledge over the employment of non-expert annotators
through services such as Amazon Mechanical Turk (Schmidt and Wiegand,2017;
Waseem et al.,2017;Nobata et al.,2016).
Because of the skewed distribution of benign and abusive micropost, some form of selective sampling is necessary in order to model online abusive language. Sev-eral studies employ a list of explicit markers or topics (such as hashtags) of abuse
3.2 bias 5
to target microposts that are likely to be abusive (Klubiˇcka and Fernández, 2018;
Waseem and Hovy, 2016). Explicit markers of abuse may be profanities, swear
words or hate terms, sourced from dictionaries such as the popular Hatebase
lexi-con.1
Generally, this form of selective sampling introduces topic domain bias, bias against certain subtypes of abuse and bias towards a very direct, textual, and
offen-sive form of explicit abuse, while ignoring implicit abuse (Klubiˇcka and Fernández,
2018;Schmidt and Wiegand,2017;Ribeiro et al.,2018). Wiegand et al.(2019)
com-pute the pointwise mutual information (PMI) for the six most-used datasets in the field to illustrate the coherence of certain words with classes. PMI is a metric to compare the most informative features of a class. What their paper undoubtedly shows is that almost all datasets have some form of topic bias. For instance, the highest-ranking abuse markers in the Waseem-dataset are commentator and football, making it extremely biased towards the topic of sports. Using explicit markers such as the hashtag WomanAgainstFeminism also biases a dataset towards a certain form of abuse, sexism, but leaves other forms of abuse unscathed.
On top of that, the underlying reasoning that abusive users frequently use
ex-plicit terms lacks evidence. Davidson et al.(2017) show that only 5% of the tweets
collected via the Hatebase lexicon were coded as hate speech, demonstrating the im-precision of this lexicon. They conclude that the presence or absence of particular
offensive terms can both help and hinder accurate classifications. Fast et al. (2016)
show that the choice of the vocabulary of hateful users is different when compared to their normal counterparts: words related to masculinity, love, and curses occur
more often. Ribeiro et al. (2018) compare hateful users to normal users, defining
hateful users according to the Twitter guidelines: A hateful user is a user that en-dorses any type of content that "promotes violence against or directly attack or threaten other people on the basis of race, ethnicity, national origin, sexual
orienta-tion, gender, gender identity, religious affiliaorienta-tion, age, disability, or disease.".2
Their study shows that hateful users use words related to hate, anger, shame, terrorism, violence, and sadness significantly less than non-hateful users (p-value < 0.001). This in turn questions the usage of bad-keywords lexicons in the data collection process.
Examples of implicit markers of abuse are condescension, minimization, benev-olent stereotyping, sarcasm, euphemisms, and micro-aggressions that are typically
linguistically subtle (Waseem et al.,2017;Jurgens et al.,2019). As such, basic
word-filters do not provide a sufficient remedy for implicit abuse detection (Schmidt and
Wiegand, 2017). Several studies note that the implicit forms of online abuse are
much harder to identify and annotate (Davidson et al., 2017; Ribeiro et al., 2018;
Waseem et al.,2017). The literary overview provided byJurgens et al.(2019) also
concludes that implicit forms of abuse are not targeted by current NLP advances.
Wiegand et al. (2019) have shown that datasets that have a higher proportion of
implicit abuse are more affected by the bias introduced in the data collection via focused sampling methods. In order to create a more balanced dataset that gener-alizes better towards the concept of online abusive language, implicit abuse should be targeted more.
If the set of tweets belonging to a certain class come from the same or a specific
group of authors, this introduces author bias. Ribeiro et al.(2018) andKlubiˇcka and
Fernández(2018) found that the user distribution of abusive tweets on the
Waseem-dataset Waseem-dataset is highly skewed. More than 70% of all sexist tweets originate from
two authors and 99% of the racist tweets originate from a single author. Qian et al.
(2018) and Mishra et al.(2018) suggest that author information might be relevant
for abuse detection, however evaluated these claims solely on the Waseem-dataset,
which is highly biased towards author bias. Therefore,Ribeiro et al.(2018) conclude
1
www.hatebase.org
2
3.3 best practices for dataset creation 6
that these claims about relevant author information should be considered less pre-dictive.
3.3
best practices for dataset creation
Bias Proposed solution
Contradictory annotation guidelines
Sharply define online abusive language and make use of two-fold typology provided in
Waseem et al.(2017).
Topic bias
Avoid focused sampling via bad-keywords dictionaries or controversial microposts and verify the distribution of topics via the most informative features.
Overemphasis of explicit abuse
Ribeiro et al.(2018) andFounta et al.(2018)
propose focused sampling methods other than bad-keywords dictionaries.
Underexposure of implicit abuse
Implicit forms of abuse should be targeted more because of the sparse nature of these forms even within abusive microposts. Author bias
The number of microposts per author should be restricted and the individual author characteristics such as location and gender should be analyzed and balanced.
Table 1: Identified biases and the proposed solutions provided in relevant literature.
The current body of literature on online abusive language proposes several meth-ods aimed specifically at reducing the previously mentioned biases. The typology
provided byWaseem et al.(2017) can be used in the data collection and annotation
process in order to sharply define subtasks used (Davidson et al.,2017).
Annotation guidelines should be developed that are based on a clear and
ac-cepted definition of online abusive language (Schmidt and Wiegand,2017;Jurgens
et al.,2019). A well-accepted definition constitutes online abusive language as "an
omnibus term that includes hurtful, derogatory or obscene utterances expressed
to-wards a targeted group or specific members of that group" (Davidson et al.,2017;
Jurgens et al.,2019;Wiegand et al.,2019). Ribeiro et al.(2018) provide a very
well-documented overview of the data collection and annotation process and guidelines that helps reduce contradictions. Additionally, a more transparent documentation of the annotation process and modeling strategies directed at different types of
abuse increases the reproducibility of experiments (Waseem et al., 2017).
Anno-tators could be given the entire profile of an author, instead of individual tweets (Ribeiro et al.,2018;Waseem et al.,2017). The improvement of annotation quality
by extending such an additional context is a promising research direction (Ribeiro
et al.,2018).
Data scarcity should be addressed while minimizing the harm caused by data
collection (Schmidt and Wiegand,2017). Focused sampling methods other than
bad-keywords dictionaries are proposed byRibeiro et al. (2018), who sampled tweets
via a semi-supervised network analysis of a set of seed users and Founta et al.
(2018) who used a random sample and applied some heuristics in order to boost the
proportion of abusive microposts. During the data-collection process, both explicit
and implicit forms of online abuse should be targeted equally (Jurgens et al.,2019).
As a means to avoid topic bias, sources of training data should be sought that are hateful without necessarily using particular keywords or explicit offensive language (Davidson et al.,2017). Furthermore, the most informative features of a class should
3.3 best practices for dataset creation 7
these features (terms) randomly, via sampling additional microposts (Ribeiro et al.,
2018).
In order to avoid author bias, the number of microposts per author should be
restricted (Ribeiro et al.,2018). To create a more balanced user distribution, the focus
should be on the analysis of the individual characters such as location, gender, and
age of the authors (Davidson et al.,2017;Waseem and Hovy,2016).
These dimension of bias have an effect on the performance of an abusive
lan-guage detection system. Because of this,Wiegand et al.(2019) propose evaluating
a classifier on a dataset different from the one it was trained on as all classifiers mentioned in their paper perform worse to in-domain classification.
With all these methods in consideration, a less-biased and well-balanced gold-standard benchmark dataset balanced can be developed.
4
D A T A
This chapter provides a description of the datasets, data collection and data anno-tation. Section 4.1 presents the three datasets used to collect data: HatEval, Offen-sEval and AbuOffen-sEval. Section 4.2 provides the data distribution and discusses the data collection, class label conversion, and processing constraints. Section 4.3 first presents the methodology for annotating abusiveness, including a description of the annotators, Inter-Annotator Agreement and the differences in pre-annotation and gold labels. Second, it presents the annotations for explicitness. Third, it looks at the measures and distributions in an attempt to reduce bias, including author bias, previous tweet bias, creation date bias, and tweet text bias.
4.1
datasets
Data for this research were collected using tweet IDs from two shared tasks of the International Workshop on Semantic Evaluation. HatEval is the shared task con-cerning hate speech towards women and immigrants and OffensEval is a shared task concerning offensive microposts. Twitter is the domain for both datasets and thus the domain for this research. For the second, more fine-grained binary clas-sification of the explicitness of abuse a third dataset, namely AbusEval is used to evaluate the classification results in a cross-domain experimental design. This sec-tion will further discuss these datasets. The tweet IDs extracted from the two shared tasks, which were used to collect data via the Twitter API (section 4.2) will be further addressed as the seed (tweet) IDs.
4.1.1 HatEval
The HatEval Task is the fifth task of the International Workshop on Semantic Eval-uation 2019 (SemEval): Multilingual Detection of Hate Speech Against Immigrants
and Women in Twitter.1
The task is organized on two related classification subtasks: a main binary subtask A for detecting the presence of hate speech, and a second, more fine-grained subtask B to first identify the target (individual or generalized
group) as specified byWaseem et al.(2017), and second to identify the presence of
aggression (Basile et al.,2019).
Hate Speech (HS) is commonly defined as "any communication that disparages a person or a group on the basis of some characteristics such as race, color, ethnicity,
gender, sexual orientation, nationality, religion, or other characteristics" (Nockleby,
2000).
In total there are 13,000 tweets distributed over a train-set of 9,000 tweets, a development-set of 1,000 tweets, and a test-set of 3,000 tweets. Class distribution over these sets can be found in Table 2. Distribution of the annotated Hate Speech for the two targets of this task, immigrants, and women, can be found in Table 3.
The data are made publicly available without any preprocessing steps, such as the removal of user mentions.
Data were annotated by non-trained contributors on the crowdsourcing platform
Figure Eight2
, and the annotation of the HS is defined as a binary value indicating if HS is occurring against one of the given targets, women or immigrants: 1 if it 1
https://www.aclweb.org/portal/content/semeval-2019-international-workshop-semantic-evaluation
2
http://www.figure-eight.com/
4.1 datasets 9
Train Dev Test Total
Hateful 3.783 427 1.260 5.470
Non-Hateful 5.217 573 1.740 7.530
Total 9.000 1.000 3.000 13.000
Table 2: HatEval 2019 class distribution over the train-, dev- and test-set.
occurs, 0 if not (Basile et al., 2019). Annotators were given a series of guidelines,
including the definition of hate speech and a list of examples. The reliability of the annotators was validated with a restricted set of "correct" answers. Three indepen-dent judgments for each tweet were required with relative majority voting. Average confidence (a measure combining Inter-Annotator Agreement, IAA, and reliability of the contributor) on the English data is 0.83 (almost perfect agreement) for HS (Basile et al.,2019).
Training Test
Label Imm. Women Imm. Women
Hateful 39.76 44.44 42.00 42.00
Non-Hateful 60.24 55.56 58.00 58.00
Table 3: Distribution percentages across sets for HS binary annotation layer, taken from Table
1ofBasile et al.(2019).
The presence of hateful tweets in the training- and test-set accounts for about
40% of the total tweets in the sets. This presence of hateful tweets is therefore
over-represented with respect to the distribution observed in the data collected from Twitter, as in the whole originally annotated dataset only about 10% of the dataset contained hate speech, which is more in line with the estimates mentioned byFounta et al.(2018);Ribeiro et al.(2018) in chapter 3.
The highest-ranking team scored a macro-averaged (macro) F1-score of 0.651 by training a Support Vector Machine model (SVM) with Radius Basis Function (RBF) kernel only on the provided data, exploiting sentence embeddings from Google’s
Universal Sentence Encoder (Cer et al.,2018) as features.
The authors conclude that hate speech detection against women and immigrants on Twitter is a challenging field, with a large room for improvement and they hope that the dataset made available as a part of this shared task fosters further research on the topic.
4.1.2 OffensEval
The OffensEval shared task concerns a series of shared tasks on offensive language identification organized at the International Workshop on Semantic Evaluation
(Se-mEval).3
OffensEval models offensive content using a hierarchical annotation of three subtasks: (A) a binary classification of offensive and non-offensive microp-osts, a more fine-grained binary subtask (B) whether the micropost has a target
and a multi-class subtask (C) for the target identification (Zampieri et al., 2020).
OffensEval concerns two consequent shared tasks:
1. OffensEval 2019 Task 6 of SemEval 2019: Identifying and Categorizing
Offen-sive Language in Social Media.4
2. OffensEval 2020 Task 12 of SemEval 2020, OffensEval 2: Multilingual
Offen-sive Language Identification in Social Media.5
Offensive microposts are defined as "posts containing any form of targeted or untargeted, non-acceptable language or profanity, which can be veiled or direct". 3 https://sites.google.com/site/offensevalsharedtask 4 https://www.aclweb.org/portal/content/semeval-2019-international-workshop-semantic-evaluation 5 https://www.aclweb.org/portal/content/semeval-2020-international-workshop-semantic-evaluation
4.1 datasets 10
It includes insults, threats, and posts containing profane language or swear words
Zampieri et al.(2019b).
Data for subtask A were annotated as offensive (OFF) or non-offensive (NOT) and the class distribution over the train and test set can be found in Table 4. The data for OffensEval 2019 and 2020 are enumerated below:
1. Data for the OffensEval 2019 shared task are the Offensive Language
Identifi-cation Dataset (OLID), which contains 14,100 English tweets in total, 13,240 of
which in the training-set and 860 in the test-set (Zampieri et al.,2019a).
2. Data for the OffensEval 2020 shared task are the Semi-Supervised Offensive
Language Identification Dataset (SOLID), which contains 12 million tweets in
total, 9 million of which in the training-set and 3 million in a test-set (
Rosen-thal et al.,2020). As this shared task is still ongoing, the original test-set has
yet to be released and the data used consists of the training-set. An analysis-set, named OffensEval 2020 test, was provided to test classifiers on that is also used in the experiments in this research.
OffensEval 2019 OffensEval 2020 Train Test Total Training Test
OFF 4.400 240 4.640 1.080
NOT 8.840 620 9.460 2.807
Total 13.240 860 14.100 9.089.140 3.887
Table 4: Data distribution for subtask A of English OffensEval 2019 and OffensEval 2020
Training, taken from Table 3 ofZampieri et al.(2019b) and Table 3 ofZampieri et al. (2020).
OLID for SemEval 2019 was annotated using the crowdsourcing platform Figure
Eight6
. Experienced annotators were verified using test questions to discard annota-tors who did not achieve a certain threshold. All the tweets were annotated by two people. In case of disagreement, a third annotation was requested, and the majority usedZampieri et al.(2019a).
OffensEval 2020 Training was collected from Twitter using the 20 most com-mon English stopwords, such as the, of, and, to, etc. to ensure random tweets
were collected. Langdetect7
was used to select English tweets and tweets with less than 18 characters were discarded. OffensEval 2020 Training was then labeled in a semi-supervised manner using democratic co-training with OLID as a seed dataset (Zampieri et al.,2020). Four models were used in the semi-supervised approach:
PMI (Turney and Littman, 2003), FastText (Joulin et al., 2016), LSTM (Hochreiter
and Schmidhuber, 1997), and BERT (Devlin et al.,2018). The average confidence
scores for the ensemble of the four models are added in the publicly available Of-fensEval 2020 Training, but no labels are added to the training data. Offensive tweets (OFF) for the test-set were selected using this semi-supervised approach and annotated manually. 2,500 non-offensive tweets (NOT) were included using this approach without being annotated. Inter-Annotator Agreement was computed on a small subset of offensive (OFF) instances and is 0.988 (almost perfect agreement)
for subtask A (Zampieri et al.,2020).
Both OLID and OffensEval 2020 Training follow the same annotation schema and have been anonymized and normalised using the same methods: No user metadata or Twitter IDs were stored and the URLs and Twitter user mentions were
substituted by an @URL and @USER placeholder (Zampieri et al., 2019a;
Rosen-thal et al.,2020). For OffensEval 2020 Training URLs are present in the training-set.
Tweets containing URLS are however present in the OffensEval 2020 test-set. Table
5shows four examples and their annotations taken from OffensEval 2020 Training.
6
http://www.figure-eight.com/
7
4.2 collection 11
ID annotation tweet text
A1 OFF @USER @USER He’s an evil law breaker that should be
in prison with his criminal heartless family
A47 NOT i’m not hating on itzy ofc they do their thing but the fact
that yall are a bunch of hypocrites
A49 OFF what a good liar and pretender :joy: :joy: :joy:
A95 OFF @USER This is disgusting - you ought to be ashamed of
yourself.
Table 5: Four example and annotations taken from OffensEval 2020 test.
Seven among the ten top-performing teams for OffensEval 2019 used BERT,
a pre-training of Deep Bidirectional Transformer (Devlin et al., 2018). The
top-performing team used BERT-base-uncased with default parameters, a max sentence length of 64 and trained for 2 epochs, reaching a macro F1-score of 0.829 which is
1.4 better than the second team. The top ten teams for OffensEval 2020 used BERT
Devlin et al.(2018), roBERTa-base and XLM-RoBERTa-large (Liu et al.,2019) trained
on subtask A, sometimes as an ensemble that also included CNNs (Kim,2014) and
LSTMs (Hochreiter and Schmidhuber,1997). The best-performing team achieved a
macro F1-score of 0.9204 using an ensemble of ALBERT models of different sizes. Overall the organizers of OffensEval 2019 and 2020 observe a trend that the best teams in all languages and subtasks used models with pre-trained contextual embeddings, most notably BERT.
4.1.3 AbusEval
AbusEval is a newly created resource aimed to address some of the existing issues in the annotation of offensive and abusive language and is responsible for a dataset
that takes into account the degree of explicitness (Caselli et al.,2020). This dataset
is specifically created to evaluate abusive annotations on an explicit/implicit axes,
as defined by the typology byWaseem et al. (2017). It is an annotation layer that
makes use of the data in OffensEval’s 2019 OLID. This more fine-grained annotation layer is added on top of OLID and will serve for the evaluation of the explicitly and implicitly abusive annotated data of this study. This is a sequential classification: an annotation is either non-abusive (NOT) or abusive (ABU), and if so it is either explicitly abusive (EXP) or implicitly abusive (IMP).
The data distribution of AbusEval is shown in Table 6.
Label Train Test Total
NOT 10.491 682 11.173
EXP 2.023 106 2.129
IMP 726 72 798
Total 13.240 860 14.100
Table 6: Train and test data distribution for the non-abusive (NOT), explicitly (EXP) and
implicitly (IMP) abusive annotations of AbusEval.
4.2
collection
Data for this research were sourced via all the tweet IDs of the dataset used in the HatEval shared task (ch 4.1.1 and a selection based on two heuristics detailed in section 4.2.2 of the 9 million tweet IDs, used in the OffensEval shared task (ch.
4.1.2). Data for this study were re-collected for two main reasons. First, the focal
point of this research is the inclusion of contextual features in the form of user-related features such as the user description, and a micropost-user-related feature in
4.2 collection 12
the form of the previous tweet: the tweet that the current tweet is responding to. These features are missing in the publicly available datasets for both shared tasks. Second, there are discrepancies in the preprocessing and anonymization steps taken in OffensEval 2020 Training and HatEval.
In order to collect un-preprocessed and not-anonymized data with contextual features, data were re-collected via source tweet IDs from the HatEval and OffensE-val shared tasks in the first two weeks of March 2020 via the Twitter API. Data were
collected via Tweepy8
, an easy-to-use Python library for accessing the Twitter API. To illustrate, a status or tweet is retrieved via the following GET method:
api.get_status(tweet_id, tweet_mode='extended')
Since HatEval’s dataset is annotated for a binary classification of hate speech and OffensEval 2020 Training is annotated for a binary classification of offensiveness, label conversion is necessary in order to create a dataset for the binary classification of abuse.
4.2.1 Twitter API
Data were collected during the first two weeks of March 2020 using the Twitter API
with a registered app named ‘abuse-dataset‘ with APP ID 17476346.9
The Twitter API limits the use rate to 900 requests per 15 minutes. For time-wise feasibility,
data collection was migrated to a server that uses a distributed task queue10
im-plemented in a web-app developed in a popular Python web framework Django.11
This allowed for a continuous run of the data-collection process. The features ex-tracted from the Twitter API for the tweets and Twitter users can be found in Table
30and Table 31, presented in Appendix A.1. Data are stored in a relational database
scheme of tweets, Twitter users, hashtags, URLs, user mentions, and symbols. An overview of the relational database scheme can be found in Appendix A.2.
4.2.2 Class Label Conversion
For HatEval, data were collected via the seed tweets in the original training-, development-and test-set found in Table 2. Tweets that contain hate speech were annotated with the abusive class (ABU), and consequently, tweets that do not contain hate speech were annotated with the non-abusive class (NOT).
The HatEval source data all sets (train-, dev- and test-set) have been re-collected and annotated. There are only 13,000 tweets in the HatEval dataset and after pre-processing, only a small subset of tweets is left to be used in the experiments.
For OffensEval 2020 Training there are no pre-annotated labels available, but instead, the average confidence score for the four ensemble classifiers is available. Data were collected via two heuristics. First, tweets with an average confidence score of over 0.87 were collected and pre-annotated with the abusive class (ABU). Second, tweets with an average confidence score under 0.11 were collected and pre-annotated with the non-abusive class (NOT). Figures 1 and 2 show the distribution of the average confidence scores from the four ensemble classifiers of OffensEval
2020Training.
These thresholds were chosen to constraint the data collection process in time and quantity. OffensEval 2020 Training is annotated for offensiveness and an as-sumption in the pre-annotated labels is that tweets that are highly likely to be non-offensive are also non-abusive (confidence < .11) and that tweets that are highly likely to be offensive are abusive (confidence > .87).
Table 7 shows the class distribution of total collected data per source dataset. 8 https://www.tweepy.org/ 9 https://developer.twitter.com/en 10 https://docs.celeryproject.org/en/stable/index.html 11 https://www.djangoproject.com/
4.2 collection 13
Figure 1: Average confidence scores for the
non-abusive (NOT) class from Of-fensEval 2020 Training.
Figure 2: Average confidence scores for the
abusive (ABU) class from OffensE-val 2020 Training.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Total ABU 3.783 427 1.260 na. 22.725 28.195 NOT 5.217 573 1.740 11.385 na. 18.915 Total 9.000 1.000 3.000 11.385 22.725 47.110
Table 7: Pre-annotated class distribution of data sourced from HatEval and OffensEval via
seed tweet IDs.
4.2.3 Data Distribution
The tweet IDs sourced from HatEval’s dataset are created within a time frame of August 2010 until September 2019, with the majority of the tweets (4.072 tweets) created in the last five months of the data collection period. The collected tweets for OffensEval 2020 Training have a creation date within a time frame of April 2009 until October 2019, with the majority of the tweets (26.625) in the last four months of data collection. October and August account for 24.896 of these 26.625 total collected tweets, whereas September only accounts for 32 tweets.
Since data were re-collected via the Twitter API within the first two weeks of March, some tweets or users might have been deleted in between the publishing date of the HatEval dataset and OffensEval 2020 Training, both by Twitter users
themselves or by the medium Twitter as a result of violating its guidelines.12
Table 8 shows the distribution of the pre-annotated classes over the data that have been retrieved via the Twitter API. These represent the active tweets for which no exception from the Twitter API was returned. Examples of these exceptions are shown in 3. Some information, tweets or users, have been lost because of these Twitter API exceptions. The HatEval dataset was reduced from 13,000 tweets in the original dataset to 7,708 active tweets retrieved via Twitter API as of March 2020. OffensEval 2020 Training was reduced from 34,110 tweets to 26,980 active tweets retrieved via Twitter API as of March 2020.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Total ABU 2.010 225 726 na. 17.215 20.176 NOT 3.442 341 964 9.765 na. 14.512 Total 5.452 566 1.690 9.765 17.215 34.688
Table 8: Pre-annotated class distribution of the data after collection via the Twitter API.
An interactive visualization of the demographics of both the HatEval and
Offen-sEval dataset over the abusive, non-abusive class or both can be found here.13
. The images below show two of the interactive visualizations.
12
https://help.twitter.com/en/rules-and-policies
13
4.3 annotation 14
Figure 3: Distribution of the exceptions and
tweets that are replying for the HatEval dataset.
Figure 4: Most used terms by count for the
non-offensive (NOT) class tweets extracted from the OffensEval dataset.
4.2.4 Processing Constraints
Several processing constraints, outlined in the enumeration below ensure that the collected tweets contain all the contextual features in the @replies dataset. Moreover, tweets that do not reply to a previous tweet were excluded from the @replies dataset. The features or model fields that support these decisions can be found in Appendix A.1.
First, the tweet text should not be truncated. This is achieved by using the
tweet_mode='extended'argument, as shown in the Tweepy illustration in 4.2. Sec-ond, the tweet should be active so that the @replies dataset does not include ex-ception messages. Third, tweets cannot be a quoted tweet, a case where the tweet is quoting another tweet. Fourth, the tweet should have a parent tweet that it is replying to. Fifth, this parent tweet should not have the same author, c.q. replying to oneself.
To summarise:
1. truncated: Tweets cannot be truncated. 2. active: Tweets must be active.
3. is_quote_status: Tweets cannot be quoted.
4. in_reply_to_status_id: Tweets must be a reply to another tweet. 5. in_reply_to_self: Tweets cannot be a reply to the author himself.
Table 10 shows the class distribution after the processing constraints and rep-resent the dataset available for annotations. In total there were 1,129 tweets in the @replies dataset available for the judgement of the human annotators that met the summarised preprocessing criteria. The HatEval dataset was reduced from the
7,708 active tweets retrieved via the Twitter API as of March 2020 to 1,447 tweets
after the application of the preprocessing steps. OffensEval 2020 Training was re-duced from the 26,980 active tweets retrieved via the Twitter API as of March 2020 to 8,674 tweets after the application of the preprocessing steps.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Total ABU 411 64 198 na. 5.818 6.491 NOT 406 54 314 2.856 na. 3.630 Total 817 118 512 2.856 5.818 10.121
Table 9: Pre-annotated class distribution of the @replies dataset after the application of
pre-processing steps.
4.3
annotation
The name of the annotated dataset is the @replies dataset. Tweets have been anno-tated by 41 student annotators including the researcher for the binary classification of abuse, in three annotation rounds:
4.3 annotation 15
1. Abusive and non-abusive binary annotation round by 41 student annotators
including the author of the research.
2. Gold label re-annotation of abusive and non-abusive binary annotation for the
tweets that did not reach consensus.
3. Explicit and implicit annotation round by the author of the research for all
tweets that were annotated in the previous two rounds as abusive.
The annotations from the first annotation round resulted in a total of 1,958 an-notations, 749 of which had conflicting labels (i.e. two or more annotators gave dif-ferent labels to the tweet). These conflicting tweets have been re-annotated by the researcher in a second, gold-label annotation round. Inter-Annotator Agreement Fleiss Kappa score for each group and pair-wise Cohen Kappa scores are shown in Appendix B. Of the abusive tweets 3,256 were annotated as explicit (EXP) and only
403as implicit (IMP) by the researcher of this study in a third annotation round.
Figure 5: Graphical interface for the annotations with the user features, previous tweet and
two buttons for the binary classification.
Annotation guidelines for the abuse/no-abuse classification can be found in Ap-pendix D.1. Annotation guidelines for the explicit/implicit classification can be found in Appendix D.2.
Annotations were facilitated by the Django web-app previously introduced in
4.2. A visualization of the annotation app is shown in Figure 5. The user features
are shown on the left and include the profile image, screen name, user name, user description, and user location as well as the number of followers and friends. For both the current tweet and the previous tweet (shown in grey) the text and number of favorites and retweets are shown.
4.3.1 Abusiveness
There are 1,447 annotated tweets from the HatEval source data, and 8,674 annotated tweets from OffensEval 2020 Training in the @replies dataset.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Friends Source missing Total
ABU 410 79 267 6 2.897 0 0 3.659
NOT 407 39 245 2.850 2.921 21 8 6.491
Total 817 118 512 2.856 5.818 21 8 10.150
1.447 8.695 8
Table 10: Class distribution of abusive (ABU) and non-abusive (NOT) tweets for the @replies
4.3 annotation 16
There are 29 tweets that are sourced by friends, which are followers of the users in our source data whom also are being followed by that user (follow + follow back). 21 of these are friends with users from the OffensEval dataset. In a promising research direction, over 35.000 tweets from users that are friends with Twitter users in our source data have also been scraped and a very small portion (29) was annotated. Foremost this is due to time-wise feasibility of this research. Moreover, no boosting mechanism was used to promote abusive tweets except for scraping tweets from abusive users. Inspection of the tweets and the annotated showed that the vast majority of these tweets were non-abusive. Only a small portion was included, to maintain a distribution of roughly one-third abusive tweets and two-third non-abusive tweets.
To conclude, from HatEval and OffensEval source tweets plus the tweets sourced by friends there are 10,150 annotated tweets for abusiveness: 3,659 annotated as the abusive (ABU) class and 6,491 as the non-abusive (NOT) class.
Annotators
A total of 44 student annotators and the researcher divided over 14 groups anno-tated the tweets for the abusive/non-abusive binary classification. Each student was assigned 150 tweets, 50 in common with the other student annotators in its group, and 100 tweets that they annotated individually without overlap with other student annotators. The researcher of this study annotated all tweets in the anno-tation set. 42 annotators annotated 150 tweets and only two annotators annotated less than that: ann27 annotated only 17 tweets and ann10 annotated 103 tweets. The annotations were part of a Bachelor’s Information Science course for which credit was awarded if 100 tweets were annotated and a bonus credit added if 150 tweets were annotated. It is likely that ann10 annotated just over the threshold of
100 annotations as a result of this. The student annotators provided 6.420
annota-tions. Demographics of the annotation (Bender and Friedman,2018) for 41 of the
45annotators can be found in Appendix C.
Inter-Annotator Agreement
Inter-Annotator Agreement was calculated in both Cohen’s Kappa Score for pair-wise annotators in a group and overall Fleiss Kappa scores of a group. Both Co-hen’s Kappa and Fleiss kappa metrics for all groups are presented in the tables in Appendix B. Mean Fleiss kappa score over all groups is 0.5506κ (moderate agree-ment). A threshold of 45 overlapping annotations was decided upon to be met to ensure consistency of the IAA evaluation metrics. Ann27, who only annotated 17 tweets, consequently does not meet this threshold and is removed from the Inter-Annotator Agreement evaluation. Fleiss Kappa is especially low for groups 2, 5 and
12. Further inspection of these groups did not yield conclusive results, except that
the overall majority of annotations that did not achieve consensus had a class distri-bution of three-to-one: either three annotations as abusive and one as non-abusive or vice-versa. In group 2: 78.26%, group 5: 80.0% and group 12: 80.85% of the annotations had a three-to-one class distribution.
Gold Label Annotation
In total there were 749 annotations for which there was no consensus achieved in the annotations: at least one annotation differed from the other annotations. Because of the difficulty of the task, as shown by the low mean Fleiss Kappa score another annotation round was conducted by the researcher of this paper to obtain gold annotation labels. The researcher re-annotated the 749 tweets without consensus and provided gold labels for these annotations.
4.3 annotation 17
Pre-annotated & Gold Labels
Inspection of the distribution of the pre-annotated labels in comparison to the gold labels after the annotation rounds for both the HatEval dataset and OffensEval 2020 Training proved relevant for two main reasons.
First, for HatEval 497 (34.35%) and for OffensEval 2020 Training 2,934 (33.74%) of the pre-annotated labels were either changed from abusive (ABU) to non-abusive (NOT) or vice-versa. The main difference between the annotation schemes consists of the additional inclusion of context-related features in the annotation process in this research. Roughly a third of the annotations changed, arguably due to the inclusion of these context-related features. The relevant insight here is that context changes annotations.
Second, offensive labelled tweets from OffensEval 2020 Training, with an aver-age confidence of the ensemble classifiers over .87 have been reduced from 5,818 offensive tweets to only 2,987 abusive tweets. This is a strong indication that of-fensiveness does not imply abusiveness. Tweets that contain offensive swear words can very well be non-abusive. Following this indication, we can say that offensive swear words are used to enforce both non-abusive and abusive messages.
gold pre-annotation tweet ID 1029489592905347074
abuse no-abuse previous
The Westminster terrorist has been named as Salih Khater, a Sudanese Muslim migrant (now UK national) from Birmingham. https://t.co/HOiAtBIpV5
current @USER Also, water is wet. #SendThemBack
gold pre-annotation tweet ID 919096901818925056
abuse no-abuse previous My barber loyal. Just like me https://t.co/EflMIchbxI current @USER You a lie and your barber a h** too n**** your
line up looking like a serrated knife
gold pre-annotation tweet ID 1039390652755267584
no-abuse abuse previous
These are the heroes who risk their lives every day. #CopsLivesMatter https://t.co/3wC8LSX4rW current @USER Get these criminals off the roads and off
the streets! #BackTheBlue #BuildThatWall
gold pre-annotation tweet ID 919097303457132544
no-abuse abuse previous
i swear, i feel like ppl be taking my tweets just switching some words. LOLOL but then again
current @USER Girl people really do that, like damn b**** you so selfish you don’t wanna rt me and get me clout
Table 11: Tweets with gold and pre-annotated label differences for tweets with source
HatE-val 2019 dataset.
Table 11 details the difference in annotation between pre-annotated labels and the gold labels from this research for four tweets from HatEval source data. The first tweet with ID 1029489592905347074 shows a tweet that is very sarcastic, but it’s only made relevant by the content in the previous tweet. Implying that the statement that a terrorist is a Sudanese Muslim is as relevant the statement that water is wet is very implicitly abusive. The second tweet with ID 919096901818925056 shows a tweet with very explicit content that is pre-annotated as non-abusive, when it is very abusive. This, in combination with the low Fleiss Kappa scores that we saw in the annotation rounds in this research, reinforces the argument that annotating abusiveness in microposts is a difficult task in and of itself.
The third tweet with ID 1039390652755267584 shows that the current tweet is reinforcing the statement made in the previous tweet that cops are heroes who get criminals off the streets. This is however only made relevant by the context provided in the previous tweet. The fourth tweet with ID 919097303457132544 again shows that the current tweet is responding and the offensive or abusive part "damn b****" is a generalized statement. This generalized statement reinforces what the previous tweet is actually saying, and this abuse in this previous tweet does not hold a target, making it non-abusive. Because the previous tweet is non-abusive and the current tweet only reinforces the statement, it should therefore also be non-abusive.
4.3 annotation 18
gold pre-annotation tweet ID 1158331171819966465
abuse no-abuse previous
Mitch McConnell fractures his shoulder after tripping https://t.co/0IENh94lAh current @USER Thoughts and prayers :vomit:
gold pre-annotation tweet ID 1159528346373349376
abuse no-abuse previous
Looking forward to joining @USER today on @USER in #Kentucky to discuss the news of the day. Tune in live at 4:05 ET today: https://t.co/TKlXiZ1ism
current @USER @USER @USER Hopefully the news of the day is that youare retiring.
gold pre-annotation tweet ID 1187550316981391360
no-abuse abuse previous
Closed my eyes and just had the weight of my city of Staten Island my family and the wu tang legend @USER on my back
https://t.co/wt6cg7nukl
current @USER @USER You killed that s*** @USER
gold pre-annotation tweet ID 1186821711511130114
no-abuse abuse previous WHY DID I TAKE SO LONG TO TRY APEX I LOVE THIS S*** current @USER That s*** fun if your whole team kill thirsty
Table 12: Tweets with gold and pre-annotated label differences for tweets with source
Offen-sEval 2020 Training.
Table 12 shows that the data are indeed annotated for offensiveness and not abusiveness and the assumption that offense equals abuse as mentioned in section
4.2.2 does not hold. The first two tweets (tweet IDs: 1158331171819966465 and
1159528346373349376) do not contain swear words but do contain a very sarcastic
tone. The first tweet, with ID 1158331171819966465, contains a throw-up emoticon after the "Thoughts and Prayers", which is unfortunately not rendered in the Latex
PDF14
), making the sarcastic tone very explicit. The average confidence score from the four ensemble classifiers is 0,0951. The second tweet is very sarcastic in nature, implying that the author of the previous tweet should retire. This tweet is not in the OffensEval bucket of confidence scores under .11, making the case unique. It is likely that this tweet was annotated as offensive with a high confidence score because of the hashtag #SendThemBack, which was not annotated as abusive in our annotation round. The two following tweets (tweet IDS: 1187550316981391360 and 1186821711511130114) contain the swear word s***, but the context of that tweet shows that it is merely a reinforcement to enforce their statement, which is a positive message.
In Appendix F.1 and F.2 you will find a total of ten tweets per dataset that contain differences in the pre-annotated and gold labels. These analyses also show the difference in data annotated for hate speech or offensiveness and abusiveness.
4.3.2 Explicitness
Data that have been annotated as abusive by the student annotators have also been annotated for the explicit (EXP) and implicit (IMP) axes as defined in the typology byWaseem et al.(2017). In total there are 3,659 abusive annotations, 3,256 annotated
as explicitly abusive (EXP), and 403 as implicitly abusive (IMP). The distribution of the explicit and implicit class is much more skewed: the implicitly abusive represent only about 11% of the total annotated tweets. Table 13 shows the distributions of the classes over the sets in the HatEval dataset and OffensEval 2020 Training.
The few (6) annotations with low confidence scores for the offensiveness of the tweet from the ensemble classifiers in OffensEval 2020 Training are all annotated as implicitly abusive. Furthermore, 58 annotations with high confidence scores for the offensiveness of the tweet from the ensemble classifiers are annotated as implicitly abusive. These tweets are thus classified as explicit by the ensemble classifiers of OffensEval 2020 Training, but annotated as implicit. Inspection of these 58 tweets could further indicate why. An example is tweet ID 1161509981348401152 with an average confidence score of 0.8946 for offensiveness: "@USER Except Trump is NOT 14
4.3 annotation 19
HatEval OffensEval
Train Dev Test c < .11 c > .87 Total
EXP 215 38 164 0 2.839 3.256
IMP 195 41 103 6 58 403
Total 410 79 267 6 2.897 3.659
756 2.903
Table 13: Class distribution of explicitly abusive (EXP) and implicitly abusive (IMP) for the
@replies dataset.
A RACIST and that makes you a liar...". This tweet is likely tagged as offensive because of the capitalized RACIST, but is annotated in our dataset as implicitly abusive since it implies that the @USER has a deceiving nature. Another example is tweet ID 1185903137573560322 with an average confidence score of 0.8970 for offensiveness: "@USER Shut ur mouth - you couldn’t even wipe ur own ass for
95% of thos". This tweet is likely classified as offensive, because of ass and shows
the difficulty of annotating abuse, as it is annotated as implicitly abusive in our annotations.
4.3.3 Bias Reduction
In order to balance the dataset as much as possible, as mentioned in detail in chapter
3, two dimension of bias, author bias and bias in the contextual feature of the
previous tweet, are accounted for. There are more potential dimension of bias that have been introduced in section 3.2, but only two of them have been accounted for. In order to account for author bias, tweets written by the same author were removed from the @replies dataset used in the experiments. Some of the previous tweets did not contain texts. Empty previous tweets are also removed from the @replies dataset used in the experiments.
Because of these two steps taken, all annotated data comes from unique single authors and contains textual previous tweets that the current tweet is responding to. This resulted in a more evenly distributed annotation set size of 8,129 tweets.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Friends Source missing Total
ABU 221 62 170 6 2.401 0 0 2.860
NOT 235 24 154 2.485 2.346 18 7 5.269
Total 456 86 324 2.491 4.747 18 7 8.129
866 7.256 7
Table 14: Class distribution of abusive (ABU) and non-abusive (NOT) tweets for the @replies
dataset when duplicate authors and empty previous tweets are removed.
HatEval OffensEval
Train Dev Test c < .11 c > .87 Friends Source missing Total
EXP 136 34 99 0 2.354 0 0 2.623
IMP 85 28 71 6 47 0 0 237
Total 221 62 170 6 2.401 0 0 2.860
453 2.407 0
Table 15: Class distribution of explicit (EXP) and implicit (IMP) abusive tweets for the
@replies dataset when duplicate authors and empty previous tweets are removed.
Table 14 shows the abusive/non-abusive class distribution of the @replies dataset used for the experiments. There are no duplicate authors and no tweets with empty previous tweet texts, however, there are still 362 tweets that share their parent tweet with at least one other tweet. Table 15 shows the class distribution of explicit and implicit abusive class, where there are still 214 tweets that share their parent tweet.
4.3 annotation 20
Author Bias
Of the 10,150 annotated tweets, there are 148 tweets that have the same author.
144 duplicate authors are from the HatEval 2019 dataset and only four are from
OffensEval 2020 Training.
For instance, one of the users has twelve tweets to his name and is a bot that responds with the time when you ask it the time: "@USER the time is indeed
08/12/2019 09:09:36 thanks for contacting us.". Many of the duplicate authors only
have two tweets to their name, however, there are some outliers. There are 15 au-thors that have authored more than five tweets in the @replies dataset. One author, originally from the HatEval 2019 dataset has 46 tweets to his name and does not appear to be a bot or spam account.
Previous Tweet Bias
There are two types of bias related to the previous tweets: empty previous tweets and duplicate previous tweets. There are 1603 tweets in the @replies dataset that have an empty previous tweet: indicating that either the Twitter author has removed the tweet’s text or there was an exception returned from the Twitter API as shown in
4.2. There are 405 tweets that have the same previous tweet that they are responding
to. It is likely that these duplicate parent tweets are present in the HatEval and OffensEval datasets due to their data collection methodology.
Inspection of the tweets that share a parent tweet showed that for the abu-sive/not abusive annotations, two tweets with the same parent tweet do not nec-essarily also share the same class. In many of the cases, the classes are different: one is abusive and the other is not or the other way around. For the overwhelm-ing majority of tweets with the same parent tweet, the gold label annotations were equal. For the explicit/implicit axes, all of the tweets sharing the same parent tweet had the same class label. Examples of these phenomena are shown in the Table 16 below.
gold tweet ID 1161145025620672512
abuse previous F*** THIS SOB! Twitter you know what to do. FIND THIS SOB! https://t.co/YkKYxXb5Wz current @USER What a piece of s***
gold tweet ID 1160662948701884417
no-abuse previous F*** THIS SOB! Twitter you know what to do. FIND THIS SOB! https://t.co/YkKYxXb5Wz current @USER What the f*** Just, what the f***.
gold tweet ID 1158331171819966465
implicit previous Cory Booker & Kamala Harris competing for Most Hysterical Woman at the Kavanaugh hearings. current
@USER Ann, ever think you might be a bit hysterical yourself. You lost your ability to reason long ago. Not sure when you had the heartectomy. One thing for sure. You don’t belong in the human race. Damn.
gold tweet ID 1038010523852398592
implicit previous Cory Booker & Kamala Harris competing for Most Hysterical Woman at the Kavanaugh hearings. current @USER I thought you already had the title. Oh wait, you said most hysterical WOMAN.
Table 16: Four example tweets that share the same parent tweet for both the abuse/no-abuse
and explicit/implicit annotations.
Creation Date Bias
Figures 6 and 7 show the abusive and non-abusive class distribution of the creation dates of tweets sourced from HatEval’s dataset and OffensEval 2020 Training for the months with more than 5 tweets per month. Appendix E shows the full creation date distribution per class per dataset.
Figures 8 and 9 show the class distribution of explicit and implicit tweets per dataset. As shown in chapter 3 it is relevant to look at these distributions in order to accurately identify possible bias in the @replies dataset. Figure 6 shows an even distribution of the abusive and non-abuse class for the tweets sourced from HatEval, and figure 6 shows an expected distribution of less abusive tweets than non-abusive tweets in all months of OffensEval 2020 Training.
4.3 annotation 21
Figure 6: Abusive and non-abusive class
dis-tribution of the Tweet creation dates for annotated tweets with-out bias sourced from HatEval’s dataset.
Figure 7: Abusive and non-abusive class
dis-tribution of the Tweet creation dates for annotated tweets without bias sourced from OffensEval 2020 Training.
Figure 8: Explicit and implicit class
distribu-tion of the Tweet creadistribu-tion dates for annotated tweets without bias sourced from HatEval’s dataset.
Figure 9: Explicit and implicit class
dis-tribution of the Tweet creation dates for annotated tweets without bias sourced from OffensEval 2020 Training.
When looking at the distribution of the explicit and implicit classes however, there is much more bias in both datasets. Figure 8 shows that explicit (EXP) anno-tated tweets are mainly coming from June 2017, October 2017 and September 2019 for HatEval and Figure 9 shows that all implicit tweets from OffensEval 2020 Train-ing are comTrain-ing from August 2019. For OffensEval this is to be expected, as there are only 6 implicitly annotated tweets.
Tweet Text Bias
There are a total of three pairs of tweets that contain exactly the same tweet text. These three pairs of duplicate tweet text tweets are included in the @replies dataset and this is another possible bias. Inspection showed that they are often small-sized and very generalizable tweets such as "@USER Shut the f*** up" (tweet IDs
1157827990597898242and 1162113791817068566). Only one of the pairs of duplicate
tweet text tweets share the same previous tweet: in the case of the tweet with ID