hate
speech analysis on Twitter
With human annotation platform Swivl and BERT
SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER
OF SCIENCE
Kimberley van Ruiven
10800395
M
ASTER
I
NFORMATION
S
TUDIES
Information Systems
F
ACULTY OF
S
CIENCE
U
NIVERSITY OF
A
MSTERDAM
24 Augustus 2020
1st Examiner 2nd Examiner
R. Gopalakrishna Pillai dhr. F.M. Nack
“To hate
Is an easy lazy thing But to love
Takes strength Everyone has But not all are Willing to practice”
speech analysis on Twitter
With human annotation platform Swivl and BERT
Kimberley van Ruiven, 10800395, University of Amsterdam https://github.com/pastatimes/Master-Thesis-HateSpeech-Twitter ABSTRACT
This study compared baseline classification algorithms to
pre-trained Google Transformer model BERT for detecting
online political hate speech on Twitter over the course of
six years by creating a labelled corpus. The problem around
online political hate speech is that it often is hard to define
and therefore even harder to detect. Building on existing
definitions of hate, the multi-label classification of tweets
as HATE, OFFENSIVE and NEUTRAL and the comparison
baseline classification algorithms to Google BERT could offer
new insights into the analysis of online hate. By extracting
online political hate when using keywords and combining
this with the manual analysis of non-expert annotators to
build a labelled corpus that can be given to non-trained and
pre-trained models, their performance can be measured in
terms of their f1-score, precision and recall. The impact of
this research is that it offers a better understanding of the
best performing model to detect online hate and hopefully,
better insight into ways to create more racially and socially
unbiased datasets.
Keywords. Natural Language Processing, hate speech de-tection, Twitter, human annotation, multi-label text
classi-fication, Swivl, machine-learning, classification algorithms,
BERT
1 INTRODUCTION
’....These THUGS are dishonoring the memory of George
Floyd, and I won’t let that happen. Just spoke to Governor
Tim Walz and told him that the Military is with him all the
way. Any difficulty and we will assume control but, when
the looting starts, the shooting starts. Thank you!’ — Donald
J. Trump (@realDonaldTrump) May 29, 20201
This is the first time that a tweet coming from a US President
has been flagged according to Twitter’s guidelines on ’the
glorification of violence’2. Factors that Twitter takes into consideration when labelling a tweet as an expression of
toxic behavior, are when a group of people is attacked on
1
https://twitter.com/realDonaldTrump/status/1266231100780744704?ref𝑠𝑟 𝑐=
𝑡 𝑤𝑠𝑟 𝑐
2
https://help.twitter.com/en/rules-and-policies/glorification-of-violence
the grounds of characteristics that belong to the nature of
their being which could lead to hate-driven violence or
intol-erance3. Such socio-demographic factors can include race, religious affiliation, ethnicity, national origin, age,
disabil-ity, sexual orientation or gender. Stating that certain events
of hate driven violence can occur as a consequence of hate
speech is a slippery slope. Because what exactly, does hate
entail?
Over the years there have been several attempts to define
hate speech. Nockleby [22] defines hate speech as ’any
com-munication that disparages a person or a group on the basis
of some characteristics such as race, color, ethnicity, gender,
sexual orientation, nationality, religion or other
characteris-tics’. Davidson [7] on the other hand defines it as ’language
that is used to express hatred towards a group or is intended
to be derogatory, to humiliate, or to insult members of a
group’. Another way to approach it is according to
Salmi-nen’s 2020 definition of hate speech. He describes (online)
hate as:
’(...) the use of language that contains either hate speech
targeted toward individuals or groups, profanity, offensive
language, or toxicity – in other words, comments that are
rude, disrespectful, and can result in negative online and
offline consequences for the individual, community, and
so-ciety at large.’ [27]
Expressing hate is nothing new. Increased levels of hate
speech can always be perceived in events prior to historic
tragedies4. The global rise of right-wing populism and the focus on immigration and Islam as a treat for the national
identity over the past few years has made hate speech a
more common trait of ordinary politics5[14][11]. In addition to this, the surge of social media has made it progressively
easier to express notions of hate speech online [29]. One
of the notable differences between offline and online hate
is the scope of their impact. Cyberbullying for example, a
form of online hate directed against an individual, can in
the worst cases lead to suicide [10]. The recent death of a
3 https://help.twitter.com/en/rules-and-policies/glorification-of-violence 4 https://www.un.org/en/holocaustremembrance/docs/pdf/Volume%20I/The𝐻𝑜𝑙 𝑜𝑐𝑎𝑢𝑠𝑡𝑎𝑠𝑎 𝐺𝑢𝑖𝑑 𝑒 𝑝𝑜𝑠𝑡𝑓𝑜𝑟𝐺𝑒𝑛𝑜𝑐𝑖𝑑 𝑒𝐷𝑒𝑡 𝑒𝑐𝑡 𝑖𝑜𝑛 .𝑝𝑑 𝑓 5 https://www.ohchr.org/EN/NewsEvents/Pages/DisplayNews.aspx?NewsID=25036LangID=E
Japanese wrestler as a response to being cyberbullied
under-lines this notion6. In a similar fashion, Williams et al.[33] used Computational Criminology to show the correlation
between hate speech in tweets targeted at race and religion
and a physical increase of attacks against these groups in
London over a period of eight months.
In the last few years alone there has been a rise in US hate
crimes that were racially fuelled, according to an FBI report
of 20177 8. Edwards and Rushin argued that a correlation exists between the outcome of the 2016 presidential US
elec-tions and a reported rise in hate crimes committed against
ethnically marginalized groups across the United States [25].
Research from Müller and Scharz showed similar findings.
They argued that an increase in anti-Muslim sentiment can
be noted since the start of Donald Trump’s campaign in
counties with a frequent use of Twitter [20]. Hate crimes
directed at Asian-Americans as a consequence of the 2020
Corona virus are another recent example of the implications
of racially motivated violence9. Lastly, the recent murder of George Floyd and the ongoing word wide protests
under-line both the gravity and the scope of the problems around
racially motivated violence10 11.
Against a backdrop of rising hate crimes, it can be argued
that there is a need for more profound methods for
detect-ing different types of online hate speech. From 2017 to 2020
there have been continuous efforts for detecting online hate
across multiple fields, both inside and outside the academic
world [7][21][17][3][18][16][24][30][12]. In 2018 for
exam-ple, Kaggle launched a competition for classifying toxic
on-line comments12. Despite the fact that hate speech is an extensively researched topic, detecting it’s online use has
proven a challenging task.
There are several factors that complicate the detection of
online hate speech. In the first place, the notion of hate is
often grounded in personal beliefs [28]. In the second place
there is no unanimous definition of hate speech that has
been used in the research on online hate speech detection
6 https://www.nytimes.com/2020/06/01/business/hana-kimura-terrace-house.html 7 https://www.npr.org/2017/11/13/563894761/fbi-data-shows-the-number-of-hate-crimes-are-rising 8 https://www.latimes.com/nation/la-na-fbi-hate-crimes-20181113-story.html 9 https://www.newyorker.com/news/letter-from-the-uk/the-rise-of-coronavirus-hate-crimes 10 https://www.justice.gov/hatecrimes/hate-crimes-case-examples 11 https://www.adl.org/education/resources/tools-and-strategies/george-floyd-racism-and-law-enforcement-in-english-and-en 12 https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/
[27] since there exist different types of hate. Hate can be
aimed at individuals or groups and can at the same time take
the shape of misogyny, xenophobia or racism. These factors
make it challenging to build on existing studies that apply
methods for detecting it [27]. Another factor that
compli-cates the research on hate speech is the lack of labeled data,
which often is a requirement for developing methods on hate
speech detection [6][19]. A different complicating factor that
is worth mentioning is the fact that the legislation around
hate speech often is unclear [5]. The US for example, does
not have a law on hate speech13. These factors complicate the detection of online expressions of hate. This research
aims to provide more insight into ways to detect online hate
speech for linguistic research.
Since the language that Donald Trump uses has been
de-fined as right-wing populism, which has shown to have a
close connection to hate speech, [4][11][8][15] and due to
the fact that he uses Twitter as his main source of
communi-cation [4][14], his tweets will be taken as a baseline to create
a corpus for the purpose of this thesis. The connection to
hate speech in combination with his extensive collection of
tweets and a following of 84.9M14at the time of writing, makes him a suitable and relevant politician to investigate
with regards to online hate. The research question of this
thesis therefore is:
How can political hate speech in tweets be detected over the course of six years with supervised multi-class algorithms and Google BERT by comparing their metrics (precision, recall and F1-score)?
In order to answer the main question, several sub-questions
are formulated:
How can a human annotated dataset from tweets be generated in which bias is taken into consideration as a factor of influence on the labels?
Which scope for text-classification is most suitable with re-spect to hate speech detection in tweets?
Which features of a tweet (including linguistic structure) are helpful for identifying it as hate speech?
Which baseline classification algorithms can be used to gen-erate a prediction result of a labeled dataset for hate speech detection? 13 http://www.ala.org/advocacy/intfreedom/hate 14 https://www.socialbakers.com/statistics/twitter/profiles/detail/25073877-realdonaldtrump
With human annotation platform Swivl and BERT
How can the baseline classification algorithms be optimized by adjusting the (hyper) parameters for obtaining the highest performance metrics (precision, recall and F1-score) for multi-label text-classification?
How can the Transformer model BERT Base be optimized by adjusting the (hyper) parameters for obtaining the highest per-formance metrics (precision, recall and F1-score) for multi-label text-classification?
The contributions through this research are the creation
of a human labeled corpus and the comparison of baseline
classification algorithms to Google Transformer model BERT.
The solution of the approach adopted to find an answer to
the research question is the detection and analysis of hate
in a corpus of US political tweets and an overview of the
highest performing model for this classification task.
This thesis is organized as follows: Section 2 describes related
work on hate speech legislation, hate speech definitions, bias
in datasets and existing methods for detecting online hate
speech. Section 3 describes the methodology with regards to
the dataset, the pre-processing steps, the extracted features
and classification models and the BERT Base Transformer
model. Section 4 describes the results of the classification
models and the BERT model in terms of their performance in
comparison to each other. Section 5 describes the discussion
and conclusion with respect to gaps in the thesis project and
future work.
2 RELATED WORK
2.1 Legislation on hate speech
First and foremost, an important factor to take into
consid-eration with regards to analyzing US hate speech are the
laws that surround it. One of the reasons hateful tweets are
challenging to restrict, is due to the fact that this often is in
contrast with the freedom of expression, which varies per
country [5]. The US First Amendment for example protects
the right of free speech. This means that expressions of hate
cannot be punished under the court of US law15[13] :
’Speech that demeans on the basis of race, ethnicity,
gen-der, religion, age, disability, or any other similar ground is
hateful; but the proudest boast of our free speech
jurispru-dence is that we protect the freedom to express “the thought
that we hate.” ’16
15
https://www.thefire.org/issues/hate-speech/
16
http://www.ala.org/advocacy/intfreedom/hate
This regulation is in contradiction to hate speech laws in
other liberal democracies, which often have an anti-hate
clause as a part of their constitution [13]. Within an
interna-tional context, these differences have led to a distinction in
the international human rights convention, called the
Inter-national Bill of Human Rights. In particular, an opposition
can be observed in two laws that are part of this convention:
the international human rights law and the international
criminal law, as the latter is more in line with the US First
Amendment and expresses itself about hate. This difference
in international legislation is a result of the fact that both
the US and other liberal democracies have tried to influence
the concept of the freedom of expression within this law, but
that the influence of the US counts more heavily [13]. The
consequence on international criminal law with regards to
hate speech is that this act cannot be punished. Therefore,
US hate speech cannot be punished according to national or
international standards.
2.2 Definitions of hate speech
With regards to the notion of online hate speech, there are
several definitions that have been used in previous research.
The following figure and table constitute an overview of
the definitions that have been used in research from 2017
to 2020, their target (group, individual OR both), followed
by the author and year [7][18][22][2][27] and their possible
consequences in terms of harm [27]. The definition that will
be used in this research is the one from Salminen [27]. In
this thesis, the emphasis will lie on hate targeted at a group,
in which a distinction is made between hate speech and
pro-fanity.
Figure 1: Degrees of harm [27]
2.3 Hate speech or profanity?
Figure 2: Overview of online hate definitions
and hard to distinguish. Not only because words often are
ambiguous but also due to the fact that the meaning of both
hate speech and offensive language is based upon small
lin-guistic characteristics [7]
In previous research on hate speech detection, Basile et al.
[1] created conditions for hate, offensive and neutral
lan-guage that were used for human annotation to build a
la-beled corpus of tweets. These descriptions provide a better
understanding of the differences between hate and profanity.
Summarized, they state that hate speech is present in a tweet
when the following requirements are met:
A. the tweet content MUST have X as main TARGET, or even
a single individual, but considered for his/her membership
in that category (and NOT for the individual characteristics)
B. we must deal with a message that spreads, incites,
pro-motes or justifies HATRED OR VIOLENCE TOWARDS THE
TARGET, or a message that aims at dehumanizing, hurting
or intimidating the target.
They state that offensive language is present in a tweet when
the requirements below are met:
A. the tweet content MUST have X as main TARGET, or even
a single individual, but considered for his/her membership
in that category (and NOT for the individual characteristics)
B. we must deal with a message that focuses on the
poten-tially HURTF UL EFFECT of the tweet content on a given
target.
Another way to approach the concept of hate, is by grouping
hate speech and offensive language under the same
defini-tion. Waseem & Hovy [32] do not make a distinction between
hate speech and offensive language. They state that a tweet
is offensive, and therefore can be considered hate speech,
when it:
1. uses a sexist or racial slur. 2. attacks a minority. 3. seeks
to silence a minority. 4. criticizes a minority (without a well
founded argument). 5. promotes, but does not directly use,
hate speech or violent crime. 6. criticizes a minority and uses
a straw man argument. 7. blatantly misrepresents truth or
seeks to distort views on a minority with unfounded claims.
8. shows support of problematic hash tags. E.g. “BanIslam”,
“whoriental”, “whitegenocide” 9. negatively stereotypes a
minority. 10. defends xenophobia or sexism. 11. contains a
screen name that is offensive, as per the previous criteria,
the tweet is ambiguous (at best), and the tweet is on a topic
that satisfies any of the above criteria.
It can be noted that in previous reseach on hate speech
detec-tion, not only the definitions of hate, but also the annotation
guidelines for detecting hate or profanity differ. This
increas-ingly complicates research for improving existing methods.
2.4 Racial and social bias in existing hate speech datasets Before starting the actual research and creating a corpus of
US political tweets, a short note on research about racial and
social bias in labeled data. Davidson et al. [6] argued that
existing hate speech and abusive language detection sets,
with the goal of detecting hate aimed at minority groups,
persistently show a racial bias towards African-Americans.
They explain that using a dataset that is biased in this
man-ner, will result in discrimination against this group by the
algorithm derived from it. Examples are when a corpus has
labeled misogynistic data as neutral or offensive language,
instead of hate, or whenslang is mistakenly labeled as hate speech [6].
Davidson et al. state that racial or social bias can be a
conse-quence of keyword-based data collection. The choice for
cer-tain keywords can, for example, lead to the over-representation
of specific minority groups within the corpus under a
spe-cific label. This can be observed in the Hatebase lexicon, an
online database of language that has been defined as hate
based on keywords, in which Afro-American English is
over-represented [6]. They state that ways to avoid or reduce bias
are by re-examining the keywords that are used to create a
corpus of hateful data. For example, specific words as the
"n-word" do not necessarily have a hateful intent towards
groups of people. Rather, this is much more dependent on
the context and, above all, on the user of the word, given the
historical origin of the term.
Furthermore, Davidson et al. state that, when chosen for
human annotated data, these annotators bring with them
their own bias [6]. According to their research this bias can
be reduced to some extent by working with several different
With human annotation platform Swivl and BERT
domain-experts, such as activists in a certain field, in
con-trast to crowd workers or academics.
Lastly, Waseem [31] conducted a study on annotator
influ-ence on hate speech detection in tweets. In his study he
focused on racism and sexism. He found that algorithms
trained on expert annotations outperform algorithms that
are trained on non-expert annotations. He also found that
expert annotators, as which he refers to as ’feminist and
anti-racist activists’ were less likely to label a tweet as hate
speech in comparison to non-expert annotators. These
find-ings are in line with the research of Davidson [6].
2.5 Methods for detecting online hate speech 2.5.1 Lexical methods
The use of online hate speech is wide-spread across
multi-ple platforms, but for the scope of this thesis this study will
solely focus on Twitter. One way to detect hate speech in
tweets is by using key-words for detecting online hateful
language. For this purpose, the hate speech lexicon Hatebase
was developed. This is a database that consists of hateful
words that can be used to filter hate speech words from social
media platforms, with the goal to predict regional violence
as a consequence of hate speech17.
Davidson et al. [7] used the Hatebase lexicon as keywords to
crawl Twitter data. They obtained 25k tweets that contained
hateful words from this repository, which they then for
ac-curacy purposes have had annotated by human workers of
Appen [7]. Out of all filtered hate tweets, only 5% of those
were labelled by human annotators as hate speech, which
shows the imprecision of Hatebase and the problem with
lexical methods. Schmidt & Wiegandt however, [29] found
that lexicon resources can give promising results when used
in combination with other methods.
2.5.2 Supervised multi-label classification
In the field of online hate speech detection, several relevant
studies have been done. In 2017, Malmasi & Zampieri [17]
did a multi-label study on methods for detecting hate speech
on social media for which they made a distinction between
hate and profanity. The dataset they used consisted of tweets
they annotated with the categories ’HATE’, ’OFFENSIVE’
and ’OK’. They also applied standard lexical features and a
linear SVM classifier as a starting point for their study. Their
results show that a character 4-gram model gave the best
results, but that discriminating hate speech from profanity
remains a difficult task. In 2018 they conducted the same
re-search but with a different dataset[16]. They conclude again
that discriminating hate speech from profanity is very hard
17
https://hatebase.org/
and that the standard features they applied might not be
enough for telling them apart with high accuracy.
Davidson et al. [7] did a study in 2017 on multi-class
meth-ods for detecting hate speech. They defined the labels HATE,
PROFANITY and NEUTRAL language and found that words
have the biggest probability to be defined as hate, when
con-taining ’multiple racial and homophobic slurs’[7]. They also
found that hate speech, in particular for distinguishing it
with offensive language, reflects people’s own racial or social
bias.
Gomez’ 2018 research [18] is partly build on the findings
of Malmasi & Zampieri. They did a study using NLP
tech-niques and emotion analysis for hate speech classification
based on the labels HATE, PROFANITY and NEITHER. In
their study they applied a combination of lexicon-based and
machine learning approaches to forecast hate speech. They
conclude that the inclusion of emotions into the model
im-proved the performance on hate speech detection.
Lastly, in 2019 Ibrohim & Budi [12] used multi-label
text-classification for hate speech detection by using machine
learning approaches such as Support Vector Machine (SVM),
Naïve Bayes (NB) and Random Forest (RF) by using features
extractions.They conclude that the Random Forest classifier
gives the best results.
2.5.3 BERT Base for classification tasks
In 2018 Google developed BERT Base, a pre-trained
Trans-former model that was trained on Wikipedia and Bookcorpus
datasets. Therefore it can be used relatively easily for
classi-fication tasks such as hate speech detection and it is freely
available on Github18[27]. It means that BERT, in contradic-tion to older classificacontradic-tion models, only has to be fine-tuned
by initializing it with pre-trained parameters. All parameters
in their turn, are fine-tuned using labeled data for particular
goals [9]. This enables BERT to outperform older
classifica-tion models that have to be build and trained from scratch.
Their findings show that in a dataset of tweets for binary
classification, BERT outperforms the traditional approaches
for classification.
Mozafari et al. came to a similar conclusion that BERT Base
outperforms previous models in research on hate speech
detection tweets in terms of precision, f1-score and recall
[19]. For this research, they classified tweets based on five
labels. To optimally use BERT Base, they fine-tuned the
pa-rameters by training the classifier. Lastly, they suggest that
18
https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz𝐼𝑥 𝑋 23𝑏𝑖𝐶 𝑃 𝑃 𝑠𝑐𝑟 𝑜𝑙 𝑙𝑇 𝑜=
using the BERT Base model could lead to fruitful insights
in future studies for reducing the bias in hate speech datasets.
Finally, Salminen et al. did a study on online hate speech
detection classifier for multiple platforms. They compared
BERT to other classification models and came to the
con-clusion that the features extracted by BERT are the most
important to forecast an outcome [27].
3 METHODOLOGY
3.1 Dataset
For the purpose of this research the tool Trump Twitter
Archive19was used. This is a large unfiltered database that contains every tweet of the Twitter account @realDonaldTrump
ranging from 2009 until today. It enables a user to search for
tweets during a specific time period based on keywords. The
Trump campaign started on June the 16th 2015. This is the
starting point of the dataset. The dataset was searched on
the 12th of May 2020. This is the end date that will be used to
analyze tweets from the @realDonaldTrump account. Each
tweet is represented by the tweet ID, the text, date, hashtag
and its possible retweet.
The total number of tweets is 24137 and the number of tweets
per year is respectively:
Year Amount 2015 7680 2016 3940 2017 2232 2018 2998 2019 5936 2020 1351 3.2 Keyword-based filtering
Existing hate speech lexicons such as Hasebase.org will not
be used for this research, due to inconsistencies and
con-tradictions in the labelling process [7]. Therefore, in this
research the tweets will be filtered based on keywords that
could potentially be associated with higher levels of hate
speech. For the purpose of this research and the limited time
scope, only tweets that could be hateful towards groups of
people will be analyzed. Any hateful language that is aimed
at a single individual will not be taken into consideration in
this case. The groups of people mentioned in Trump’s tweets
that often are associated with higher levels of hate are:
immi-grants, ’radical Islamic terrorism’ and specific countries or
nationalities, such as China [27][32][15][23][11]. Therefore,
the keywords that are used to filter this dataset are:
19
http://www.trumptwitterarchive.com/
Mexico Mexican Mexicans immigrant immigrants
Hispanic Hispanics Muslim Muslims Islam rapist
-rapists - ISIS - gang member - gang members - blacks - why
don’t they do back - bring them back from where they came
these aren’t people human traffickers coyotes invasion
-infestation - abuser - shithole countries - terrorist - terrorists
- rats - migrant - migrants - refugee - refugees -
congress-women - cartel - cartels - China - Chinese virus - viciously
telling the people of the United States - you can’t leave fast
enough
As can be noted, most keywords are single words, however
several sentences are included into this list as well. These
sentences are:
- why don’t they go back
- bring them back to where they came from
- these aren’t people
- viciously telling the people of the United States
- you can’t leave fast enough
These sentences refer to several tweets in which Trump
expressed himself about four members of Congress with
re-spect to their ethnic backgrounds20 21:
So interesting to see “Progressive” Democrat Congresswomen who originally came from countries whose governments are a complete and total catastrophe the worst most corrupt and inept anywhere in the world (if they even have a functioning government at all)
.... now loudly and viciously telling the people of the United States the greatest and most powerful Nation on earth how our government is to be run. Why don’t they go back and help fix the totally broken and crime infested places from which they came. Then come back
.... and show us how it is done. These places need your help badly you can’t leave fast enough. I’m sure that Nancy Pelosi would be very happy to quickly work out free travel arrange-ments!
After filtering the tweets based on these keywords a dataset
containing 1031 tweets is left. This amount of tweets was
chosen on purpose, due to the time period for this research
and the fact that all data has to be annotated within this
period by separate unpaid annotators.
20 https://www.nbcnews.com/politics/donald-trump/trump-says-progressive-congresswomen-should-go-back-where-they-came-n1029676 21 https://www.theatlantic.com/magazine/archive/2020/09/the-end-of-denial/614194/
With human annotation platform Swivl and BERT
The number of tweets per year after filtering is respectively:
Year Amount 2015 104 2016 122 2017 109 2018 166 2019 548 2020 72
A boxplot of the data after filtering can be noted in
Fig-ure 3:
Figure 3: Boxplot filtered tweets 2015 to 2020
The boxplot shows the length of a tweet over the years 2015
to 2020. It can be noted that the length of Donald Trump his
tweets has changed in this period. In 2017, the maximum
amount of characters that a tweet can hold has changed from
140 to 28022. This is reflected in his tweets, with a clear dis-tinction in 2017, which shows the most outliers. The gradual
shift in character length over the course of this year could be
an explanation for this. It furthermore can be concluded that
Donald Trump in all years has made use of the maximum
character length. This factor could effect the performance of
the models.
3.2.2 Human annotation based on multiple classes with Swivl
The next step of pre-processing is annotation of the data.
Annotation of the data will be done manually by three people
with respectively a background in Literature Studies,
Com-munication & Media Studies and Spatial Planning. These
22
https://techcrunch.com/2017/11/07/twitter-officially-expands-its-character-count-to-280-starting-today/
people are selected, because their studies are relevant with
regards to language or to the topics mentioned in the tweets,
with the exception of the Spatial Planning student. Other
factors that made the choice for these annotators relevant
is their near-native English proficiency and the diversity of
their backgrounds. The number of three people is chosen,
as in this way the majority decision can be used to assign a
label for each tweet.
In line with the research of Davidson [7] the data will be
labeled according to three classes: HATE SPEECH,
OFFEN-SIVE LANGUAGE and NEUTRAL. The annotators will be
given the task to look at each tweet separately and rate them
based on a scale from 0 to 2, where HATE gets the label 0
and NEUTRAL the label 2. In order to make the annotation
process a slightly less hazardous and time-consuming task,
the data labeling platform Swivl will be used. This platform
provides an interface for the annotators to label the data
according to existing hate speech guidelines adapted to fit
the purpose of this research project. These guidelines are
attached in Appendices B Annotation Guidelines.
For each annotator an account was set up for which they
could log in to annotate tweets in a pool divided over several
years, referred to as tasks. This way, the annotators could
see out of how many tweets a task consisted so they could
estimate per task how long it would take them. For the
data-labelling process a time-period of two-and-a half-weeks was
taken into account. The majority vote was calculated by the
platform Swivl to assign a label to each tweet. The
correc-tions were done on Swivl’s end to ensure that each tweet
was assigned a label. The corrections were applied when a
label did not have a majority vote, which means in the case
that the votes were distributed evenly over the three labels.
In this way it could be ensured that every tweet in the data
set was classified based on a majority vote.
3.2.3 Pre-processing
Scikit-learn was used to clean and build the pipeline for this
dataset. The pre-processing steps that are taken were:
- The removal of hashtags, URL’s and retweets. This is
irrele-vant for this hate speech task.
- Lowercasing of all tweets. This converts a string of text to
lowercase.
- Tokenization. This cuts a string of text into tokens.
- Removal of stop-words to reduce the number of noisy
fea-tures.
- Removal of punctuation and excess whitespace. This is
ir-relevant for this text-classification and therefore removed.
- Stemming with PorterStemmer, which reduces words to a
and therefore improves performance.
3.2.4 Simple Feature Extraction
For this hate speech detection study, general features for
text-classification from tweets were used to help the baseline
models identifying hate speech. The textual features that
were extracted are mostly syntactic features and can be
di-vided into three categories:
- N-grams. An n-gram is a set of words or characters of
a fixed length and can contain a sequence of ’n’-words or
’n’- characters. For this study, word n-grams were extracted,
weighted by their TF-IDF.
- TF-IDF. This is a numerical feature that measures the
rele-vance of specific words in a dataset. It this way, it becomes
clear which words carry the most, or least, meaning.
- Bag of Words (BoW). This feature can be used with TF-IDF
to collect text elements and convert them to tokens.
For each year, a bar was plotted to display how balanced
the classes are. An overview of class balance for all years is
displayed in Figure 4:
Figure 4: Class distribution 2015-2020
3.3 Baseline classification algorithms
As baseline models in this research are chosen for Random
Forest, Logistic Regression, Linear SVM and Naïve Bayes
to generate a prediction result of a labeled dataset for hate
speech detection. These classification algorithms have been
used in previous research on hate speech detection and
gen-erated positive prediction results [17][7][12]. An ensemble
classifier of Logistic Regression and Linear SVM that makes
use of a Hard Voting was also taken in consideration, but
was not used in this research because the performance of
this ensemble model was the same as the outcome of the
individual models with an accuracy of 85%. A workflow of
the selected baseline models can be depicted as follows:
Figure 5: Workflow of the classification algorithms
3.3.1 Text classification baseline setup
To ensure a good baseline for hate speech detection with
text-classification, the baselines were prepared accordingly.
The training dataset is split into five folds that were
ran-domly shuffled and Stratified cross validation was used to
measure the performance. As can be observed in the Figures
16 to 21 in Appendix A.1 Overview Class Balance, the classes
for every year are unequally distributed. Therefore,
Strati-fied K-fold was a suitable choice. Not only due to the small
sample size of the dataset, but also because this variation of
cross-validation ensures that the fold sample does not
imbal-ance the training data [26]23. The scikit-learn functionality TF-IDF Vectorizer was used to tokenize the features of the
corpus. By using this functionality, a text-document can be
converted to a list of tokens that can be given to the model
22. The parameter sub linear_tf was set to true, to obtain a
logarithmic frequency of 1 and to reduce the bias generated
by the length of the document. Use_idf was also set to true so
that the algorithm uses Inverse Document Frequency. This
means that the terms that appear most frequently to the
ex-tent they exist in the most documents get a weight assigned
to it that weights less then the weight that is assigned to a
term that appears less frequently but in specific documents.
The vector norm was set to L2 to reduce the document length
bias. Stop-words were set to English and the vector norm
was set to L2. By walking through these steps, each tweet
was represented by a set of features. The experiments were
run ten times.
3.3.2 Performance measurement
The output of each model was shown in boxplots that showed
the accuracy of the models. The highest performing model
was evaluated by displaying the actual versus the predicted
23
With human annotation platform Swivl and BERT
scores in a Confusion matrix. After that, the
misclassifica-tions and their cause were outputted, the most correlated
terms with each of their intents were shown and finally
the precision, recall and f1-score were displayed for each
of the labels. As precision and recall were ill-defined,
be-cause some labels had no predicted sample, the function
metrics.classification_report the parameter ’zero_division’
was set to 1.
3.4 BERT Base
For every year, and for all years combined, the performance
of the baseline models were compared to the pre-trained
transformer model BERT. As running BERT requires a GP U,
Google Colab was used since it provides the possibility to
use different GP Us and TCUs for free2425. The GP U that will be used for the purpose of this research is Tesla K80. In
contradiction to the classification algorithms, BERT extracts
its own features.
3.4.1 Text classification setup BERT
To obtain optimal performance, BERT was fine-tuned by
ad-justing its parameters and hyperparameters. Relevant for the
hyperparameters are the batch size, the learning rate and the
number of epochs. In contrast to the baseline models, a 90-10
split was used to create training and validation samples. In
this way, the dataset could be split randomly and be given
to the model so that it could be trained on unseen data.
For BERT, every year was trained and tested on separately
from 2015 to 2020 and all years combined were trained and
tested. In contradiction to the classification models, BERT is
pre-trained. This means that after importing the data, BERT
only has to be fine-tuned. As BERT works in a different way
than these models, after installing the Hugging face library
and the model was trained on a specific column to classify
the the labels: class_of_interest = ’intent_id’. After this the
data is cleaned and explored based on the intent_id. After
this, training and test data is extracted from the file and by
comparing the class_of_interest and the value_id before and
after extracting train and testdata, it is measured whether the
training dataset is not significantly different from the whole
dataset. After this BertTokenizer is loaded and the maximum
sentence length is printed. In addition to this, the sentences
are tokenized and the tokens are mapped to their word IDs.
Training samples and validation samples are extracted. The
batch size is set to 16, as this gave a better result than a batch
size of 64, the number of epochs is set to 4 and the learning
rate is set to default, which is 0. Seed_value is set to 42. The
24
https://research.google.com/colaboratory/faq.html
25
https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz𝐼𝑥 𝑋 23𝑏𝑖𝐶 𝑃 𝑃 𝑠𝑐𝑟 𝑜𝑙 𝑙𝑇 𝑜=
5𝑙 𝑙 𝑤𝑢 8𝐺 𝐵𝑢𝑞𝑀𝑏
accuracy, f1-score and validation loss were calculated for
every epoch.
3.4.2 Performance measurement
The output for every year was shown through the
perfor-mance metrics accuracy and f1-score. The training loss and
validation loss were also outputted to monitor whether the
model was over- or underfitting.
4 RESULTS
The creation of a human labeled corpus with human
anno-tation platform Swivl showed that the user agreement is
not entirely equally distributed. The use of guidelines for
this annotation task and the choice to have annotators from
various backgrounds, shows that racial bias can be reduced
to some extend, as can be noted in the misclassifications of
hate versus offensive language.
With regards to the most suitable scope for text-classification
for hate speech detection in tweets, sentence-level would
be the most appropriate as the classes will be assigned to
separate tweets.
4.1 Results baseline classification algorithms 2020 Labeled Trump tweets
For 2020, 72 sentences were given to the model and 600
different features were extracted. All baseline models were
trained with Stratified five-fold. For all models, the second
and third parameter were adjusted to: Shuffle=True and
ran-dom_state=42. Only for the year 2020 the tables, boxplot
and Confusion matrix are added to the text. For the previous
years, the figures are included in Appendix B.1 Unigrams,
bigrams trigrams for hate speech, B.2 Boxplots and B.3
Con-fusion matrices.
It can be noted that each fold divides the data in equal parts
and that the class distribution of the labels are preserved in
the splits. The highest performing model is Logistic
Regres-sion with an accuracy score of 60%. The boxplot in Figure 6
shows the accuracy of all models of 2020. As Logistic
Regres-sion was the highest performing model, the data was trained
and tested on this model and the performance is shown in
the Confusion Matrix of 2020 in Figure 7. This figure shows
that Neutral Language was in 13 cases predicted correctly,
Hate Speech 0 times and Offensive Language 3 times. it is
highly likely that this is due to the size of the data, which
contained out of only 72 sentences. The word n-grams that
are most often correlated to hate speech, are:
. Top unigrams: . deaths . far
. Top trigrams: . H1N1 Swine Flu . Failing New York
The table below shows the performance of Logistic
Regres-sion in terms of preciRegres-sion, recall and f1-score. It can be noted
that for the labels hate speech and offensive language the
output is in some instances zero. This means that there were
no true positives. A recall of 1 means low precision, which
means that all tweets get the label hate speech. In the case of
a precision score of 1 means no false positives, which results
in a recall score of 0.
precision recall f1-score
Neutral 0.73 0.79 0.76
HateSpeech 0.00 0.00 0.00
OffensiveLanguage 0.38 0.20 0.43
Figure 6: Boxplot accuracy scores 2020
Figure 7: Confusion matrix 2020
2019 Labeled Trump tweets
For 2019, 458 sentences were given to the models and 2968
features were extracted. For the three classes, the most
cor-related unigram, bigram and trigram features were extracted
and weighted by their TF-IDF. Five fold Stratified
cross-validation was used to train and test the models. The highest
performing model is MultinomialNB with an accuracy of 61%.
The boxplot in Figure 14 in Appendix B.2 Boxplots shows
the accuracy of all models of 2019. The data was tested on
the highest performing model, MultinomialNB. The word
n-grams for hate speech are attached in Appendix B.1. The
table below shows the performance of MultinomialNB and
Figure 19 displays the Confusion matrix that belongs to this
model. It can be noted that Neutral was most often correctly
predicted with 80 examples, followed by OffensiveLanguage
with 5 examples. Hate speech was predicted correctly in 1
example. In the table it can be noted that for hate speech, a
precision of 30% is achieved, which corresponds with a low
false positive rate. A recall of 6% for this label is pretty low
as it implies that the ratio of the actual positive predictions
compared to all observations in the actual class is pretty low
as it is below 50%. Lastly, the f1-score is the weighted average
of precision and recall. In this case is can be noted that the
f1-score is low as it is 11%.
precision recall f1-score
Neutral 0.56 0.99 0.71
HateSpeech 0.33 0.06 0.11
OffensiveLanguage 0.83 0.09 0.16
Lastly, the misclassified examples were displayed. This
out-put shows that ’OffensiveLanguage’ predicted as ’Neutral’
in 49 examples, while hate speech was in 14 cases predicted
as Neutral.
2018 Labeled Trump tweets
For 2018, 166 sentences were given to the models and 834
features were extracted. Word n-grams were extracted and
Stratified five-fold cross-validation was applied. The highest
performing model was LinearSVC with an accuracy of 64%.
The boxplot in Figure 15 in Appendix B.2 shows the accuracy
of all models of 2018. The table below shows the performance
for LinearSVC and Figure 20 displays the corresponding
Con-fusion matrix. This shows that Neutral Language was in 33
cases predicted correctly, Offensive Language 3 times and
Hate Speech 5 times. It can be noted that for hate speech,
the f1-score is higher than for the previous years. This
cor-responds with the correctly predicted cases of Hate Speech,
With human annotation platform Swivl and BERT
precision recall f1-score
Neutral 0.79 0.97 0.87
HateSpeech 1 0.42 0.59
OffensiveLanguage 0.38 0.33 0.35
2017 Labeled Trump tweets
For 2017, 109 sentences were given to the models and 296
features were extracted. The highest performing model is
MultinomialNB with an accuracy of %55. The boxplot in
Figure 21 in Appendix B.3 shows the performance of the
different models and Figure 21 shows the Confusion matrix.
Hate Speech was correctly predicted 0 times, Offensive
Lan-guage in 4 cases and Neutral LanLan-guage 15 times. In the table
below the metrics are displyed, which shows no true
posi-tives for Hate Speech.:
precision recall f1-score
Neutral 0.48 1.00 0.65
HateSpeech 1 0.00 0.00
OffensiveLanguage 0.80 0.36 0.50
2016 Labeled Trump tweets
For 2016, 122 sentences were given to the models and 291
features were extracted. The highest performing model was
MultinomialNB with an accuracy score of 53%. The boxplot
in Figure 17 in Appendix B.2 shows the accuracy scores of
all models and Figure 23 in Appendix B.3 shows the
Confu-sion matrix. Hate Speech was predicted correctly in 7 cases,
Offensive Language in 14 cases and Neutral in 0 cases. The
table below shows the metrics:
precision recall f1-score
Neutral 1.00 0.00 0.00
HateSpeech 0.50 0.47 0.48
OffensiveLanguage 0.52 0.78 0.62
2015 Labeled Trump tweets
For 2015, 104 sentences were fed to the models and each
of these tweets is represented by 234 features. The highest
performing model is LinearSVC with an accuracy of 46%.
A boxplot of the scores is shown in Figure 22 in B.2 and a
corresponding Confusion matrix in Appendix B.3. Offensive
Language was predicted correctly 6 times, Hate Speech 5
times and Neutral Language 4 times. Offensive Language
was wrongly predicted as Neutral in 12 cases. The table
be-low shows the metrics:
precision recall f1-score
Neutral 0.21 0.80 0.33
HateSpeech 0.83 0.45 0.59
OffensiveLanguage 0.60 0.32 0.41
4.2 Results BERT base 2020 Labeled Trump tweets
For 2020, as the difference between the training and test data
was not significant, the data was be used to train the model.
The maximum sentence length is 83. 74 Training samples
and 5 validation samples were created. After training, the
outcome of the epochs showed that the training loss and
validation loss were high in the first epoch and lower in the
fourth, but that validation loss was slightly lower than the
training loss. This could be a result of over-regularizing the
model. In the table below the relevant metrics are shown: 1st
Accuracy: 0.29
F1-score 0.21
Average training loss 1.14
Validation loss 1.11
2nd
Accuracy: 0.57
F1 score: 0.24
Average training loss 1.00
Validation loss 0.98
3th
Accuracy: 0.57
F1 score: 0.24
Average training loss 0.99
Validation loss 0.97
4th
Accuracy: 0.43
F1 score: 0.20
Average training loss 0.94
Validation loss 0.99
The years 2018 to 2016 were included in Appendix B.4 BERT.
Only 2020 and 2015 are shown in this section, as these years
were the most far apart and therefore compared.
2015 Labeled Trump tweets
For 2015, the output after exploring the data is not
signifi-cantly different from the whole dataset and therefore
suit-able for training the model. The maximum sentence length is
49. Furthermore 83 training samples and 10 validation
sam-ples were created. After training, the outcome of the epochs
1st
Accuracy: 0.50
F1-score 0.22
Average training loss: 1.11
Validation Loss: 0.94
2nd
Accuracy: 0.60
F1 score: 0.44
Average training loss: 0.93
Validation Loss: 0.89
3th
Accuracy: 0.60
F1 score: 0.44
Average training loss: 0.86
Validation Loss: 0.84
4th
Accuracy: 0.60
F1 score: 0.44
Average training loss: 0.78
Validation Loss: 0.82
It can be noted that the accuracy scores of 2015 are slightly
higher then the scores of 2020, but that difference of the
f1-score is more notable.The difference between the average
training loss and the validation loss remains similar.
5 DISCUSSION
As described in Section 3, a human annotated dataset can
from tweets can be generated in which bias is taken into
consideration. This is possible by means of guidelines in
combination with expert-annotators. When focusing on the
wrongly predicted examples, it is difficult to say whether
this is a result of personal bias. Examples of HATE SPEECH
predicted as NEUTRAL were:
- We are winning big time against China. Companies jobs
are fleeing. Prices to us have not gone up and in some cases
have come down. China is not our problem though Hong
Kong is not helping. Our problem is with the Fed. Raised too
much too fast.
- Wall Street Journal: More migrant families crossing into the
U.S. illegally have been arrested in the first five months of
the federal fiscal year than in any prior full year. We are
do-ing a great job at the border but this is a National Emergency!
With regards to the scope for text-classification, it can be
said that sentence-level is a suitable scope for hate speech
detection. For both the baseline models, that predicted
sen-tences of HATE SPEECH, and for BERT Base Uncased since
it can be used for Sequence Classification.
Furthermore, on the baseline models it can be said that the
performance of the models on the separate years was pretty
low. The best performing baseline models to generate a
pre-diction result for the years 2015 to 2020 were
Multinomi-alNB and Logistic Regression. The highest results for HATE
SPEECH were obtained for the year 2019 with an accuracy
of 61% with MultinomialNB and a precision of 70%. Since
a high precision corresponds with a low false positive rate,
it can be said that the model returns in this case more
rele-vant than irrelerele-vant results. For 2020 however, where only
73 sentences were fed to the model, a precision and recall
of <.0001 were returned for the label HATE SPEECH. This
means that there are no true positives and all results were
predicted wrongly. That makes sense when when taking into
regard the small size of the dataset. As the data was labelled
as HATE SPEECH, OFFENSIVE and NEUTRAL, it can be said
that for the baseline models most data was predicted to be
NEUTRAL.
Finally, about the performance of BERT it can be said that it
is slightly better than the performance of the baseline
mod-els, but worse when taking in regards the difference between
the training loss and the validation loss. For example, for
2015 the LinearSVC gives a score of 50% and BERT a score of
60% in the fourth epoch, but with a smaller average training
loss than a validation loss. This could mean that the model
is overfitting, which means it would not generalize well on
data it has not seen before.
6 CONCLUSION
It can be concluded that running a small dataset on both the
baseline models and the Google Transformer model BERT
does not lead to outstanding performance of the models. It
does show however, that the presence of political hate speech
can be detected in a small dataset, but that the models cannot
be generalized based on this study alone. Therefore, future
work is needed, especially with regards to bias in existing
datasets.
In terms of generalization of the model, it would have been
better to have more data for comparison. For example, to
compare Trump to another prominent figure or politician
with regards to hate speech and see how well the model
performs in this case. For example, a corpus of both Trumps
and Obama’s tweets could be compared to each other labeled
according to the three classes. Another way to measure how
well this corpus used for this study performs in comparison
to other datasets, would be to run the dataset of Davidson et
al. [7] on these models and compare the results.
With human annotation platform Swivl and BERT
the corpus of political tweets. With regards to the results,
it can be stated that the sample might have been too small
to generalize the results. The reason for this small sample
however, was that in this way for each separate year the
level of HATE SPEECH could be measured and therefore a
conclusion could be drawn about the presence and
develop-ment of hate in Donald Trump his tweets over the course of
six years, before and during his presidency.
Finally, using keywords to crawl a database in combination
with human annotations for creating a corpus of hateful
language could be a fruitful method for future research on
hate speech detection, when taking into regards personal
bias. These steps might not only help to reduce bias, but it
can also offer linguistic insights into methods for detecting
online hate.
ACKNOWLEDGEMENTS
I would like to thank Rodolfo Ramirez and Mason Levy from
Swivl for kindly and free of charge giving me the opportunity
to use a part of their platform for human annotation. This
made the task of annotation slightly easier for myself and my
coders. Secondly, I would like to thank Lester van der Pluijm
for providing me with helpful and extensive feedback during
the last phase of this thesis. Lastly, I am grateful to Reshmi
Gopalakrishna Pillai for supervising me on this project and
providing me with technical knowledge and suggestions.
REFERENCES
[1] Valerio Basile et al. “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”. In: SemEval@NAACL-HLT. 2019.
[2] Nina Bauwelinck et al. “LT3 at SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (hatEval)”. In: 2019, pp. 436–440. doi: 10.18653 /v1/s19- 2077. url: http://evalita.org.
[3] Shanita Biere and Master Business Analytics. “Hate Speech Detection Using Natural Language Processing Techniques”. In:Vrije Universiteit Amsterdam (2018), p. 30.
[4] Robert Alan Brookey and Brian L. Ott.Trump and Twitter 2.0. Jan. 2019. doi: 10.1080/15295036.2018.1546052.
[5] Alexander Brown. “What is so special about online (as compared to offline) hate speech?” In:Ethnicities 18.3 (June 2018), pp. 297–326. issn: 17412706. doi: 10.1177/1468796817709846.
[6] Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. “Racial Bias in Hate Speech and Abusive Language Detection Datasets”. In: (May 2019). url: http://arxiv.org/abs/1905.12516.
[7] Thomas Davidson et al. “Automated hate speech detection and the problem of offensive language”. In:Proceedings of the 11th Interna-tional Conference on Web and Social Media, ICWSM 2017. Mar. 2017, pp. 512–515. isbn: 9781577357889. url: http://arxiv.org/abs/1703. 04009.
[8] Gunn Enli. “Twitter as arena for the authentic outsider: exploring the social media campaigns of Trump and Clinton in the 2016 US presi-dential election”. In:European Journal of Communication 32.1 (Feb. 2017), pp. 50–61. issn: 14603705. doi: 10.1177/0267323116682802. [9] Santiago González-Carvajal and Eduardo C. Garrido-Merchán.
“Com-paring BERT against traditional machine learning text classification”. In: (May 2020).
[10] Sameer Hinduja and Justin W. Patchin. “Connecting Adolescent Suicide to the Severity of Bullying and Cyberbullying”. In:Journal of School Violence 18 (2019), pp. 333–346.
[11] Jack Holland and Ben Fermor. “Trump’s rhetoric at 100 days: con-tradictions within effective emotional narratives”. In:Critical Stud-ies on Security 5.2 (May 2017), pp. 182–186. issn: 2162-4887. doi: 10.1080/21624887.2017.1355157.
[12] Muhammad Okky Ibrohim and Indra Budi. “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter”. In: (2019), pp. 46–57. doi: 10.18653/v1/w19- 3506.
[13] Jean-Marie Kamatali. “The U.S. First Amendment Tradition and Ar-ticle 19”. In: (2013).
[14] Ramona Kreis. “The “Tweet Politics” of President Trump”. In:Journal of Language and Politics 16.4 (Oct. 2017), pp. 607–618. issn: 1569-2159. doi: 10.1075/jlp.17032.kre.
[15] Robin T. Lakoff. “The hollow man”. In:Journal of Language and Politics 16.4 (Oct. 2017), pp. 595–606. issn: 1569-2159. doi: 10.1075/ jlp.17022.lak.
[16] Shervin Malmasi and Marcos Zampieri. “Challenges in discriminat-ing profanity from hate speech”. In:Journal of Experimental and Theoretical Artificial Intelligence 30.2 (Mar. 2018), pp. 187–202. issn: 13623079. doi: 10.1080/0952813X.2017.1409284. url: http://arxiv.org/ abs/1803.05495.
[17] Shervin Malmasi and Marcos Zampieri. “Detecting hate speech in social media”. In:International Conference Recent Advances in Natural Language Processing, RANLP. Vol. 2017-Septe. Dec. 2017, pp. 467–472. isbn: 9789544520489. doi: 10.26615/978- 954- 452- 049- 6- 062. url: http://arxiv.org/abs/1712.06427.
[18] Ricardo Martins et al. “Hate speech classification in social media using emotional analysis”. In:Proceedings - 2018 Brazilian Conference on Intelligent Systems, BRACIS 2018. Institute of Electrical and Elec-tronics Engineers Inc., Dec. 2018, pp. 61–66. isbn: 9781538680230. doi: 10.1109/BRACIS.2018.00019.
[19] Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. “A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media”. In:Studies in Computational Intelligence 881 SCI (2020), pp. 928–940. issn: 18609503. doi: 10 . 1007 / 978 3 030 -36687- 2{\_}77.
[20] Karsten Müller et al.From Hashtag to Hate Crime: Twitter and Anti-Minority Sentiment *. Tech. rep. 2019. url: www.carloschwarz.eu,. [21] Sashaank Pejathaya MURALI. “Detecting Cyber Bullies on
Twit-ter using Machine Learning Techniques”. In:International Journal of Information Security and Cybercrime 6.1 (2017), pp. 63–66. issn: 22859225. doi: 10.19107/ijisc.2017.01.07.
[22] John T. Nockleby.HATE SPEECH. Tech. rep. Encyclopedia of the American Constitution, 2000, p. 2.
[23] Brian L. Ott.The age of Twitter: Donald J. Trump and the politics of debasement. Jan. 2017. doi: 10.1080/15295036.2016.1266686. [24] David Robinson, Ziqi Zhang, and Jonathan Tepper. “Hate speech
detection on twitter: Feature engineering v.s. feature selection”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 11155 LNCS. Springer Verlag, 2018, pp. 46–49. isbn: 9783319981918. doi: 10.1007/978- 3- 319- 98192- 5{\_}9.
[25] Stephen Rushin and Griffin Sims Edwards. “The Effect of President Trump’s Election on Hate Crimes”. In:SSRN Electronic Journal (2018). doi: 10.2139/ssrn.3102652. url: https://ssrn.com/abstract=3102652. [26] “SalkindNeilJ_2010_C_EncyclopediaOfResearc”. In: ().
[27] Joni Salminen et al. “Developing an online hate classifier for multiple social media platforms”. In:Human-centric Computing and Informa-tion Sciences 10.1 (2020), pp. 1–34. issn: 21921962. doi: 10 . 1186 / s13673- 019- 0205- 6. url: https://doi.org/10.1186/s13673- 019- 0205- 6. [28] Joni Salminen et al. “Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings”. In:2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) (2018), pp. 88–94. [29] Anna Schmidt and Michael Wiegand.A Survey on Hate Speech
De-tection using Natural Language Processing. Tech. rep., pp. 1–10. url: https://en.wikipedia.org/wiki/List_.
[30] Yiwen Tang and Nicole Dalzell. “Classifying Hate Speech Using a Two-Layer Model”. In:Statistics and Public Policy 6.1 (2019), pp. 80– 86. issn: 2330443X. doi: 10.1080/2330443X.2019.1660285. url: https: //doi.org/10.1080/2330443X.2019.1660285.
[31] Zeerak Waseem. “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter”. In: 2016, pp. 138– 142. doi: 10.18653/v1/w16- 5618. url: www.spacy.io.
[32] Zeerak Waseem and Dirk Hovy. “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter”. In: 2016, pp. 88–93. doi: 10 . 18653 / v1 / n16 - 2013. url: http : / / github . com / zeerakw/hatespeech.
[33] Matthew L Williams et al. “Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime”. In:The British Journal of Criminology 60.1 (2020), pp. 93–117. issn: 0007-0955.
Appendices
A DATA
A.1 Overview class balance
Figure 8: Distribution of the labels 2015
Figure 9: Distribution of the labels 2016
Figure 10: Distribution of the labels 2017
With human annotation platform Swivl and BERT
Figure 12: Distribution of the labels 2019
Figure 13: Distribution of the labels 2020
B RESULTS
B.1 Unigrams, bigrams & trigrams for hate speech 2019 Labeled Trump tweets
. Top unigrams: . Mexico . illegal
. Top bigrams: . United States . Human Traffickers
. Top trigrams: . Our Country F ULL . strong immigration laws
2018 Labeled Trump tweets . Top unigrams: . crime . Border
. Top bigrams: . terrorist attack . We need
. Top trigrams: . Mexico stop onslaught . Southern Border
They
2017 Labeled Trump tweets . Top unigrams: . refugees . countries
. Top bigrams: . MS 13 . THE WALL
. Top trigrams: . help North Korea . President Xi China
2016 Labeled Trump tweets . Top unigrams: . ISIS . https
. Top bigrams: . Crooked Hillary . pay wall
. Top trigrams: . Mexico pay wall . elected President going
2015 Labeled Trump tweets . Top unigrams: . https . Muslim
. Top bigrams: . The media . WE NEED
. Top trigrams: . 11 Muslim Celebrations . Source 11 Muslim
B.2 Boxplots
Figure 14: Boxplot accuracy scores 2019
Figure 15: Boxplot accuracy scores 2018
Figure 16: Boxplot 2017
Figure 17: Boxplot 2016
Figure 18: Boxplot 2015
B.4 BERT Results 2019
For 2019, the output after exploring the data for is not
signif-icantly different from the whole dataset and therefore
suit-able for training the model. The maximum sentence length is
Figure 19: Confusion matrix 2019
Figure 20: Confusion matrix 2018
49. Furthermore 83 training samples and 10 validation
sam-ples were created. After training, the outcome of the epochs
With human annotation platform Swivl and BERT
Figure 21: Confusion matrix 2017
Figure 22: Confusion matrix 2015
Figure 23: Confusion matrix 2016
1st
Accuracy: 0.63
F1-score 0.30
Average training loss 1.02
Validation loss 0.83
2nd
Accuracy: 0.67
F1 score: 0.35
Average training loss 0.86
Validation loss 0.74
3th
Accuracy: 0.76
F1 score: 0.58
Average training loss 0.71
Validation loss 0.66
4th
Accuracy: 0.78
F1 score: 0.60
Average training loss 0.62
Validation loss 0.66 Results 2018
For 2018, the output after exploring the data for the
This is not significantly different from the whole dataset and
therefore suitable for training the model. The maximum
sen-tence length is 49. Furthermore 83 training samples and 10
validation samples were created. After training, the outcome
1st
Accuracy: 0.63
F1-score 0.30
Average training loss 1.02
Validation loss 0.83
2nd
Accuracy: 0.67
F1 score: 0.35
Average training loss 0.86
Validation loss 0.74
3th
Accuracy: 0.76
F1 score: 0.58
Average training loss 0.71
Validation loss 0.66
4th
Accuracy: 0.78
F1 score: 0.60
Average training loss 0.62
Validation loss 0.66 Results 2017
For 2017, the output after exploring the data for the
This is not significantly different from the whole dataset and
therefore suitable for training the model. The maximum
sen-tence length is 83. Furthermore 88 training samples and 10
validation samples were created. After training, the outcome
of the epochs showed:
1st
Accuracy: 0.60
F1-score 0.29
Average training loss 1.16
Validation loss 1.00
2nd
Accuracy: 0.70
F1 score: 0.41
Average training loss 1.01
Validation loss 0.93
3th
Accuracy: 0.60
F1 score: 0.29
Average training loss 0.96
Validation loss 0.94
4th
Accuracy: 0.60
F1 score: 0.29
Average training loss 0.96
Validation loss 0.92 Results 2016
For 2016, the maximum sentence length is 55 and after the
90-10 train-validation split, 98 training samples were extracted
and 11 validation samples.
1st
Accuracy: 0.55
F1-score 0.24
Average training loss 1.06
Validation loss 0.98
2nd
Accuracy: 0.36
F1 score: 0.24
Average training loss 0.97
Validation loss 1.01
3th
Accuracy: 0.36
F1 score: 0.24
Average training loss 0.91
Validation loss 1.03
4th
Accuracy: 0.36
F1 score: 0.24
Average training loss 0.89
Validation loss 1.02 C Annotation guidelines