A supervised learning approach for political hate speech analysis on Twitter

(1)

hate

speech analysis on Twitter

With human annotation platform Swivl and BERT

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Kimberley van Ruiven

10800395

M

ASTER

I

NFORMATION

S

TUDIES

Information Systems

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

24 Augustus 2020

1st_Examiner ₂nd_Examiner

R. Gopalakrishna Pillai dhr. F.M. Nack

(2)

“To hate

Is an easy lazy thing But to love

Takes strength Everyone has But not all are Willing to practice”

(3)

speech analysis on Twitter

With human annotation platform Swivl and BERT

Kimberley van Ruiven, 10800395, University of Amsterdam https://github.com/pastatimes/Master-Thesis-HateSpeech-Twitter ABSTRACT

This study compared baseline classification algorithms to

pre-trained Google Transformer model BERT for detecting

online political hate speech on Twitter over the course of

six years by creating a labelled corpus. The problem around

online political hate speech is that it often is hard to define

and therefore even harder to detect. Building on existing

definitions of hate, the multi-label classification of tweets

as HATE, OFFENSIVE and NEUTRAL and the comparison

baseline classification algorithms to Google BERT could offer

new insights into the analysis of online hate. By extracting

online political hate when using keywords and combining

this with the manual analysis of non-expert annotators to

build a labelled corpus that can be given to non-trained and

pre-trained models, their performance can be measured in

terms of their f1-score, precision and recall. The impact of

this research is that it offers a better understanding of the

best performing model to detect online hate and hopefully,

better insight into ways to create more racially and socially

unbiased datasets.

Keywords. Natural Language Processing, hate speech de-tection, Twitter, human annotation, multi-label text

classi-fication, Swivl, machine-learning, classification algorithms,

BERT

1 INTRODUCTION

’....These THUGS are dishonoring the memory of George

Floyd, and I won’t let that happen. Just spoke to Governor

Tim Walz and told him that the Military is with him all the

way. Any difficulty and we will assume control but, when

the looting starts, the shooting starts. Thank you!’ — Donald

J. Trump (@realDonaldTrump) May 29, 20201

This is the first time that a tweet coming from a US President

has been flagged according to Twitter’s guidelines on ’the

glorification of violence’2. Factors that Twitter takes into consideration when labelling a tweet as an expression of

toxic behavior, are when a group of people is attacked on

1

https://twitter.com/realDonaldTrump/status/1266231100780744704?ref𝑠𝑟 𝑐=

𝑡 𝑤𝑠𝑟 𝑐

2

https://help.twitter.com/en/rules-and-policies/glorification-of-violence

the grounds of characteristics that belong to the nature of

their being which could lead to hate-driven violence or

intol-erance3. Such socio-demographic factors can include race, religious affiliation, ethnicity, national origin, age,

disabil-ity, sexual orientation or gender. Stating that certain events

of hate driven violence can occur as a consequence of hate

speech is a slippery slope. Because what exactly, does hate

entail?

Over the years there have been several attempts to define

hate speech. Nockleby [22] defines hate speech as ’any

com-munication that disparages a person or a group on the basis

of some characteristics such as race, color, ethnicity, gender,

sexual orientation, nationality, religion or other

characteris-tics’. Davidson [7] on the other hand defines it as ’language

that is used to express hatred towards a group or is intended

to be derogatory, to humiliate, or to insult members of a

group’. Another way to approach it is according to

Salmi-nen’s 2020 definition of hate speech. He describes (online)

hate as:

’(...) the use of language that contains either hate speech

targeted toward individuals or groups, profanity, offensive

language, or toxicity – in other words, comments that are

rude, disrespectful, and can result in negative online and

offline consequences for the individual, community, and

so-ciety at large.’ [27]

Expressing hate is nothing new. Increased levels of hate

speech can always be perceived in events prior to historic

tragedies4. The global rise of right-wing populism and the focus on immigration and Islam as a treat for the national

identity over the past few years has made hate speech a

more common trait of ordinary politics5[14][11]. In addition to this, the surge of social media has made it progressively

easier to express notions of hate speech online [29]. One

of the notable differences between offline and online hate

is the scope of their impact. Cyberbullying for example, a

form of online hate directed against an individual, can in

the worst cases lead to suicide [10]. The recent death of a

3 https://help.twitter.com/en/rules-and-policies/glorification-of-violence 4 https://www.un.org/en/holocaustremembrance/docs/pdf/Volume%20I/The𝐻𝑜𝑙 𝑜𝑐𝑎𝑢𝑠𝑡𝑎𝑠𝑎 𝐺𝑢𝑖𝑑 𝑒 𝑝𝑜𝑠𝑡𝑓𝑜𝑟𝐺𝑒𝑛𝑜𝑐𝑖𝑑 𝑒𝐷𝑒𝑡 𝑒𝑐𝑡 𝑖𝑜𝑛 .𝑝𝑑 𝑓 5 https://www.ohchr.org/EN/NewsEvents/Pages/DisplayNews.aspx?NewsID=25036LangID=E

(4)

Japanese wrestler as a response to being cyberbullied

under-lines this notion6. In a similar fashion, Williams et al.[33] used Computational Criminology to show the correlation

between hate speech in tweets targeted at race and religion

and a physical increase of attacks against these groups in

London over a period of eight months.

In the last few years alone there has been a rise in US hate

crimes that were racially fuelled, according to an FBI report

of 20177 8. Edwards and Rushin argued that a correlation exists between the outcome of the 2016 presidential US

elec-tions and a reported rise in hate crimes committed against

ethnically marginalized groups across the United States [25].

Research from Müller and Scharz showed similar findings.

They argued that an increase in anti-Muslim sentiment can

be noted since the start of Donald Trump’s campaign in

counties with a frequent use of Twitter [20]. Hate crimes

directed at Asian-Americans as a consequence of the 2020

Corona virus are another recent example of the implications

of racially motivated violence9. Lastly, the recent murder of George Floyd and the ongoing word wide protests

under-line both the gravity and the scope of the problems around

racially motivated violence10 11.

Against a backdrop of rising hate crimes, it can be argued

that there is a need for more profound methods for

detect-ing different types of online hate speech. From 2017 to 2020

there have been continuous efforts for detecting online hate

across multiple fields, both inside and outside the academic

world [7][21][17][3][18][16][24][30][12]. In 2018 for

exam-ple, Kaggle launched a competition for classifying toxic

on-line comments12. Despite the fact that hate speech is an extensively researched topic, detecting it’s online use has

proven a challenging task.

There are several factors that complicate the detection of

online hate speech. In the first place, the notion of hate is

often grounded in personal beliefs [28]. In the second place

there is no unanimous definition of hate speech that has

been used in the research on online hate speech detection

6 https://www.nytimes.com/2020/06/01/business/hana-kimura-terrace-house.html 7 https://www.npr.org/2017/11/13/563894761/fbi-data-shows-the-number-of-hate-crimes-are-rising 8 https://www.latimes.com/nation/la-na-fbi-hate-crimes-20181113-story.html 9 https://www.newyorker.com/news/letter-from-the-uk/the-rise-of-coronavirus-hate-crimes 10 https://www.justice.gov/hatecrimes/hate-crimes-case-examples 11 https://www.adl.org/education/resources/tools-and-strategies/george-floyd-racism-and-law-enforcement-in-english-and-en 12 https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/

[27] since there exist different types of hate. Hate can be

aimed at individuals or groups and can at the same time take

the shape of misogyny, xenophobia or racism. These factors

make it challenging to build on existing studies that apply

methods for detecting it [27]. Another factor that

compli-cates the research on hate speech is the lack of labeled data,

which often is a requirement for developing methods on hate

speech detection [6][19]. A different complicating factor that

is worth mentioning is the fact that the legislation around

hate speech often is unclear [5]. The US for example, does

not have a law on hate speech13. These factors complicate the detection of online expressions of hate. This research

aims to provide more insight into ways to detect online hate

speech for linguistic research.

Since the language that Donald Trump uses has been

de-fined as right-wing populism, which has shown to have a

close connection to hate speech, [4][11][8][15] and due to

the fact that he uses Twitter as his main source of

communi-cation [4][14], his tweets will be taken as a baseline to create

a corpus for the purpose of this thesis. The connection to

hate speech in combination with his extensive collection of

tweets and a following of 84.9M14at the time of writing, makes him a suitable and relevant politician to investigate

with regards to online hate. The research question of this

thesis therefore is:

How can political hate speech in tweets be detected over the course of six years with supervised multi-class algorithms and Google BERT by comparing their metrics (precision, recall and F1-score)?

In order to answer the main question, several sub-questions

are formulated:

How can a human annotated dataset from tweets be generated in which bias is taken into consideration as a factor of influence on the labels?

Which scope for text-classification is most suitable with re-spect to hate speech detection in tweets?

Which features of a tweet (including linguistic structure) are helpful for identifying it as hate speech?

Which baseline classification algorithms can be used to gen-erate a prediction result of a labeled dataset for hate speech detection? 13 http://www.ala.org/advocacy/intfreedom/hate 14 https://www.socialbakers.com/statistics/twitter/profiles/detail/25073877-realdonaldtrump

(5)

With human annotation platform Swivl and BERT

How can the baseline classification algorithms be optimized by adjusting the (hyper) parameters for obtaining the highest performance metrics (precision, recall and F1-score) for multi-label text-classification?

How can the Transformer model BERT Base be optimized by adjusting the (hyper) parameters for obtaining the highest per-formance metrics (precision, recall and F1-score) for multi-label text-classification?

The contributions through this research are the creation

of a human labeled corpus and the comparison of baseline

classification algorithms to Google Transformer model BERT.

The solution of the approach adopted to find an answer to

the research question is the detection and analysis of hate

in a corpus of US political tweets and an overview of the

highest performing model for this classification task.

This thesis is organized as follows: Section 2 describes related

work on hate speech legislation, hate speech definitions, bias

in datasets and existing methods for detecting online hate

speech. Section 3 describes the methodology with regards to

the dataset, the pre-processing steps, the extracted features

and classification models and the BERT Base Transformer

model. Section 4 describes the results of the classification

models and the BERT model in terms of their performance in

comparison to each other. Section 5 describes the discussion

and conclusion with respect to gaps in the thesis project and

future work.

2 RELATED WORK

2.1 Legislation on hate speech

First and foremost, an important factor to take into

consid-eration with regards to analyzing US hate speech are the

laws that surround it. One of the reasons hateful tweets are

challenging to restrict, is due to the fact that this often is in

contrast with the freedom of expression, which varies per

country [5]. The US First Amendment for example protects

the right of free speech. This means that expressions of hate

cannot be punished under the court of US law15[13] :

’Speech that demeans on the basis of race, ethnicity,

gen-der, religion, age, disability, or any other similar ground is

hateful; but the proudest boast of our free speech

jurispru-dence is that we protect the freedom to express “the thought

that we hate.” ’16

15

https://www.thefire.org/issues/hate-speech/

16

http://www.ala.org/advocacy/intfreedom/hate

This regulation is in contradiction to hate speech laws in

other liberal democracies, which often have an anti-hate

clause as a part of their constitution [13]. Within an

interna-tional context, these differences have led to a distinction in

the international human rights convention, called the

Inter-national Bill of Human Rights. In particular, an opposition

can be observed in two laws that are part of this convention:

the international human rights law and the international

criminal law, as the latter is more in line with the US First

Amendment and expresses itself about hate. This difference

in international legislation is a result of the fact that both

the US and other liberal democracies have tried to influence

the concept of the freedom of expression within this law, but

that the influence of the US counts more heavily [13]. The

consequence on international criminal law with regards to

hate speech is that this act cannot be punished. Therefore,

US hate speech cannot be punished according to national or

international standards.

2.2 Definitions of hate speech

With regards to the notion of online hate speech, there are

several definitions that have been used in previous research.

The following figure and table constitute an overview of

the definitions that have been used in research from 2017

to 2020, their target (group, individual OR both), followed

by the author and year [7][18][22][2][27] and their possible

consequences in terms of harm [27]. The definition that will

be used in this research is the one from Salminen [27]. In

this thesis, the emphasis will lie on hate targeted at a group,

in which a distinction is made between hate speech and

pro-fanity.

Figure 1: Degrees of harm [27]

2.3 Hate speech or profanity?

(6)

Figure 2: Overview of online hate definitions

and hard to distinguish. Not only because words often are

ambiguous but also due to the fact that the meaning of both

hate speech and offensive language is based upon small

lin-guistic characteristics [7]

In previous research on hate speech detection, Basile et al.

[1] created conditions for hate, offensive and neutral

lan-guage that were used for human annotation to build a

la-beled corpus of tweets. These descriptions provide a better

understanding of the differences between hate and profanity.

Summarized, they state that hate speech is present in a tweet

when the following requirements are met:

A. the tweet content MUST have X as main TARGET, or even

a single individual, but considered for his/her membership

in that category (and NOT for the individual characteristics)

B. we must deal with a message that spreads, incites,

pro-motes or justifies HATRED OR VIOLENCE TOWARDS THE

TARGET, or a message that aims at dehumanizing, hurting

or intimidating the target.

They state that offensive language is present in a tweet when

the requirements below are met:

A. the tweet content MUST have X as main TARGET, or even

a single individual, but considered for his/her membership

in that category (and NOT for the individual characteristics)

B. we must deal with a message that focuses on the

poten-tially HURTF UL EFFECT of the tweet content on a given

target.

Another way to approach the concept of hate, is by grouping

hate speech and offensive language under the same

defini-tion. Waseem & Hovy [32] do not make a distinction between

hate speech and offensive language. They state that a tweet

is offensive, and therefore can be considered hate speech,

when it:

1. uses a sexist or racial slur. 2. attacks a minority. 3. seeks

to silence a minority. 4. criticizes a minority (without a well

founded argument). 5. promotes, but does not directly use,

hate speech or violent crime. 6. criticizes a minority and uses

a straw man argument. 7. blatantly misrepresents truth or

seeks to distort views on a minority with unfounded claims.

8. shows support of problematic hash tags. E.g. “BanIslam”,

“whoriental”, “whitegenocide” 9. negatively stereotypes a

minority. 10. defends xenophobia or sexism. 11. contains a

screen name that is offensive, as per the previous criteria,

the tweet is ambiguous (at best), and the tweet is on a topic

that satisfies any of the above criteria.

It can be noted that in previous reseach on hate speech

detec-tion, not only the definitions of hate, but also the annotation

guidelines for detecting hate or profanity differ. This

increas-ingly complicates research for improving existing methods.

2.4 Racial and social bias in existing hate speech datasets Before starting the actual research and creating a corpus of

US political tweets, a short note on research about racial and

social bias in labeled data. Davidson et al. [6] argued that

existing hate speech and abusive language detection sets,

with the goal of detecting hate aimed at minority groups,

persistently show a racial bias towards African-Americans.

They explain that using a dataset that is biased in this

man-ner, will result in discrimination against this group by the

algorithm derived from it. Examples are when a corpus has

labeled misogynistic data as neutral or offensive language,

instead of hate, or whenslang is mistakenly labeled as hate speech [6].

Davidson et al. state that racial or social bias can be a

conse-quence of keyword-based data collection. The choice for

cer-tain keywords can, for example, lead to the over-representation

of specific minority groups within the corpus under a

spe-cific label. This can be observed in the Hatebase lexicon, an

online database of language that has been defined as hate

based on keywords, in which Afro-American English is

over-represented [6]. They state that ways to avoid or reduce bias

are by re-examining the keywords that are used to create a

corpus of hateful data. For example, specific words as the

"n-word" do not necessarily have a hateful intent towards

groups of people. Rather, this is much more dependent on

the context and, above all, on the user of the word, given the

historical origin of the term.

Furthermore, Davidson et al. state that, when chosen for

human annotated data, these annotators bring with them

their own bias [6]. According to their research this bias can

be reduced to some extent by working with several different

(7)

With human annotation platform Swivl and BERT

domain-experts, such as activists in a certain field, in

con-trast to crowd workers or academics.

Lastly, Waseem [31] conducted a study on annotator

influ-ence on hate speech detection in tweets. In his study he

focused on racism and sexism. He found that algorithms

trained on expert annotations outperform algorithms that

are trained on non-expert annotations. He also found that

expert annotators, as which he refers to as ’feminist and

anti-racist activists’ were less likely to label a tweet as hate

speech in comparison to non-expert annotators. These

find-ings are in line with the research of Davidson [6].

2.5 Methods for detecting online hate speech 2.5.1 Lexical methods

The use of online hate speech is wide-spread across

multi-ple platforms, but for the scope of this thesis this study will

solely focus on Twitter. One way to detect hate speech in

tweets is by using key-words for detecting online hateful

language. For this purpose, the hate speech lexicon Hatebase

was developed. This is a database that consists of hateful

words that can be used to filter hate speech words from social

media platforms, with the goal to predict regional violence

as a consequence of hate speech17.

Davidson et al. [7] used the Hatebase lexicon as keywords to

crawl Twitter data. They obtained 25k tweets that contained

hateful words from this repository, which they then for

ac-curacy purposes have had annotated by human workers of

Appen [7]. Out of all filtered hate tweets, only 5% of those

were labelled by human annotators as hate speech, which

shows the imprecision of Hatebase and the problem with

lexical methods. Schmidt & Wiegandt however, [29] found

that lexicon resources can give promising results when used

in combination with other methods.

2.5.2 Supervised multi-label classification

In the field of online hate speech detection, several relevant

studies have been done. In 2017, Malmasi & Zampieri [17]

did a multi-label study on methods for detecting hate speech

on social media for which they made a distinction between

hate and profanity. The dataset they used consisted of tweets

they annotated with the categories ’HATE’, ’OFFENSIVE’

and ’OK’. They also applied standard lexical features and a

linear SVM classifier as a starting point for their study. Their

results show that a character 4-gram model gave the best

results, but that discriminating hate speech from profanity

remains a difficult task. In 2018 they conducted the same

re-search but with a different dataset[16]. They conclude again

that discriminating hate speech from profanity is very hard

17

https://hatebase.org/

and that the standard features they applied might not be

enough for telling them apart with high accuracy.

Davidson et al. [7] did a study in 2017 on multi-class

meth-ods for detecting hate speech. They defined the labels HATE,

PROFANITY and NEUTRAL language and found that words

have the biggest probability to be defined as hate, when

con-taining ’multiple racial and homophobic slurs’[7]. They also

found that hate speech, in particular for distinguishing it

with offensive language, reflects people’s own racial or social

bias.

Gomez’ 2018 research [18] is partly build on the findings

of Malmasi & Zampieri. They did a study using NLP

tech-niques and emotion analysis for hate speech classification

based on the labels HATE, PROFANITY and NEITHER. In

their study they applied a combination of lexicon-based and

machine learning approaches to forecast hate speech. They

conclude that the inclusion of emotions into the model

im-proved the performance on hate speech detection.

Lastly, in 2019 Ibrohim & Budi [12] used multi-label

text-classification for hate speech detection by using machine

learning approaches such as Support Vector Machine (SVM),

Naïve Bayes (NB) and Random Forest (RF) by using features

extractions.They conclude that the Random Forest classifier

gives the best results.

2.5.3 BERT Base for classification tasks

In 2018 Google developed BERT Base, a pre-trained

Trans-former model that was trained on Wikipedia and Bookcorpus

datasets. Therefore it can be used relatively easily for

classi-fication tasks such as hate speech detection and it is freely

available on Github18[27]. It means that BERT, in contradic-tion to older classificacontradic-tion models, only has to be fine-tuned

by initializing it with pre-trained parameters. All parameters

in their turn, are fine-tuned using labeled data for particular

goals [9]. This enables BERT to outperform older

classifica-tion models that have to be build and trained from scratch.

Their findings show that in a dataset of tweets for binary

classification, BERT outperforms the traditional approaches

for classification.

Mozafari et al. came to a similar conclusion that BERT Base

outperforms previous models in research on hate speech

detection tweets in terms of precision, f1-score and recall

[19]. For this research, they classified tweets based on five

labels. To optimally use BERT Base, they fine-tuned the

pa-rameters by training the classifier. Lastly, they suggest that

18

https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz𝐼𝑥 𝑋 23𝑏𝑖𝐶 𝑃 𝑃 𝑠𝑐𝑟 𝑜𝑙 𝑙𝑇 𝑜=

(8)

using the BERT Base model could lead to fruitful insights

in future studies for reducing the bias in hate speech datasets.

Finally, Salminen et al. did a study on online hate speech

detection classifier for multiple platforms. They compared

BERT to other classification models and came to the

con-clusion that the features extracted by BERT are the most

important to forecast an outcome [27].

3 METHODOLOGY

3.1 Dataset

For the purpose of this research the tool Trump Twitter

Archive19was used. This is a large unfiltered database that contains every tweet of the Twitter account @realDonaldTrump

ranging from 2009 until today. It enables a user to search for

tweets during a specific time period based on keywords. The

Trump campaign started on June the 16th 2015. This is the

starting point of the dataset. The dataset was searched on

the 12th of May 2020. This is the end date that will be used to

analyze tweets from the @realDonaldTrump account. Each

tweet is represented by the tweet ID, the text, date, hashtag

and its possible retweet.

The total number of tweets is 24137 and the number of tweets

per year is respectively:

Year Amount 2015 7680 2016 3940 2017 2232 2018 2998 2019 5936 2020 1351 3.2 Keyword-based filtering

Existing hate speech lexicons such as Hasebase.org will not

be used for this research, due to inconsistencies and

con-tradictions in the labelling process [7]. Therefore, in this

research the tweets will be filtered based on keywords that

could potentially be associated with higher levels of hate

speech. For the purpose of this research and the limited time

scope, only tweets that could be hateful towards groups of

people will be analyzed. Any hateful language that is aimed

at a single individual will not be taken into consideration in

this case. The groups of people mentioned in Trump’s tweets

that often are associated with higher levels of hate are:

immi-grants, ’radical Islamic terrorism’ and specific countries or

nationalities, such as China [27][32][15][23][11]. Therefore,

the keywords that are used to filter this dataset are:

19

http://www.trumptwitterarchive.com/

Mexico Mexican Mexicans immigrant immigrants

Hispanic Hispanics Muslim Muslims Islam rapist

-rapists - ISIS - gang member - gang members - blacks - why

don’t they do back - bring them back from where they came

these aren’t people human traffickers coyotes invasion

-infestation - abuser - shithole countries - terrorist - terrorists

- rats - migrant - migrants - refugee - refugees -

congress-women - cartel - cartels - China - Chinese virus - viciously

telling the people of the United States - you can’t leave fast

enough

As can be noted, most keywords are single words, however

several sentences are included into this list as well. These

sentences are:

- why don’t they go back

- bring them back to where they came from

- these aren’t people

- viciously telling the people of the United States

- you can’t leave fast enough

These sentences refer to several tweets in which Trump

expressed himself about four members of Congress with

re-spect to their ethnic backgrounds20 21:

So interesting to see “Progressive” Democrat Congresswomen who originally came from countries whose governments are a complete and total catastrophe the worst most corrupt and inept anywhere in the world (if they even have a functioning government at all)

.... now loudly and viciously telling the people of the United States the greatest and most powerful Nation on earth how our government is to be run. Why don’t they go back and help fix the totally broken and crime infested places from which they came. Then come back

.... and show us how it is done. These places need your help badly you can’t leave fast enough. I’m sure that Nancy Pelosi would be very happy to quickly work out free travel arrange-ments!

After filtering the tweets based on these keywords a dataset

containing 1031 tweets is left. This amount of tweets was

chosen on purpose, due to the time period for this research

and the fact that all data has to be annotated within this

period by separate unpaid annotators.

20 https://www.nbcnews.com/politics/donald-trump/trump-says-progressive-congresswomen-should-go-back-where-they-came-n1029676 21 https://www.theatlantic.com/magazine/archive/2020/09/the-end-of-denial/614194/

(9)

With human annotation platform Swivl and BERT

The number of tweets per year after filtering is respectively:

Year Amount 2015 104 2016 122 2017 109 2018 166 2019 548 2020 72

A boxplot of the data after filtering can be noted in

Fig-ure 3:

Figure 3: Boxplot filtered tweets 2015 to 2020

The boxplot shows the length of a tweet over the years 2015

to 2020. It can be noted that the length of Donald Trump his

tweets has changed in this period. In 2017, the maximum

amount of characters that a tweet can hold has changed from

140 to 28022. This is reflected in his tweets, with a clear dis-tinction in 2017, which shows the most outliers. The gradual

shift in character length over the course of this year could be

an explanation for this. It furthermore can be concluded that

Donald Trump in all years has made use of the maximum

character length. This factor could effect the performance of

the models.

3.2.2 Human annotation based on multiple classes with Swivl

The next step of pre-processing is annotation of the data.

Annotation of the data will be done manually by three people

with respectively a background in Literature Studies,

Com-munication & Media Studies and Spatial Planning. These

22

https://techcrunch.com/2017/11/07/twitter-officially-expands-its-character-count-to-280-starting-today/

people are selected, because their studies are relevant with

regards to language or to the topics mentioned in the tweets,

with the exception of the Spatial Planning student. Other

factors that made the choice for these annotators relevant

is their near-native English proficiency and the diversity of

their backgrounds. The number of three people is chosen,

as in this way the majority decision can be used to assign a

label for each tweet.

In line with the research of Davidson [7] the data will be

labeled according to three classes: HATE SPEECH,

OFFEN-SIVE LANGUAGE and NEUTRAL. The annotators will be

given the task to look at each tweet separately and rate them

based on a scale from 0 to 2, where HATE gets the label 0

and NEUTRAL the label 2. In order to make the annotation

process a slightly less hazardous and time-consuming task,

the data labeling platform Swivl will be used. This platform

provides an interface for the annotators to label the data

according to existing hate speech guidelines adapted to fit

the purpose of this research project. These guidelines are

attached in Appendices B Annotation Guidelines.

For each annotator an account was set up for which they

could log in to annotate tweets in a pool divided over several

years, referred to as tasks. This way, the annotators could

see out of how many tweets a task consisted so they could

estimate per task how long it would take them. For the

data-labelling process a time-period of two-and-a half-weeks was

taken into account. The majority vote was calculated by the

platform Swivl to assign a label to each tweet. The

correc-tions were done on Swivl’s end to ensure that each tweet

was assigned a label. The corrections were applied when a

label did not have a majority vote, which means in the case

that the votes were distributed evenly over the three labels.

In this way it could be ensured that every tweet in the data

set was classified based on a majority vote.

3.2.3 Pre-processing

Scikit-learn was used to clean and build the pipeline for this

dataset. The pre-processing steps that are taken were:

- The removal of hashtags, URL’s and retweets. This is

irrele-vant for this hate speech task.

- Lowercasing of all tweets. This converts a string of text to

lowercase.

- Tokenization. This cuts a string of text into tokens.

- Removal of stop-words to reduce the number of noisy

fea-tures.

- Removal of punctuation and excess whitespace. This is

ir-relevant for this text-classification and therefore removed.

- Stemming with PorterStemmer, which reduces words to a

(10)

and therefore improves performance.

3.2.4 Simple Feature Extraction

For this hate speech detection study, general features for

text-classification from tweets were used to help the baseline

models identifying hate speech. The textual features that

were extracted are mostly syntactic features and can be

di-vided into three categories:

- N-grams. An n-gram is a set of words or characters of

a fixed length and can contain a sequence of ’n’-words or

’n’- characters. For this study, word n-grams were extracted,

weighted by their TF-IDF.

- TF-IDF. This is a numerical feature that measures the

rele-vance of specific words in a dataset. It this way, it becomes

clear which words carry the most, or least, meaning.

- Bag of Words (BoW). This feature can be used with TF-IDF

to collect text elements and convert them to tokens.

For each year, a bar was plotted to display how balanced

the classes are. An overview of class balance for all years is

displayed in Figure 4:

Figure 4: Class distribution 2015-2020

3.3 Baseline classification algorithms

As baseline models in this research are chosen for Random

Forest, Logistic Regression, Linear SVM and Naïve Bayes

to generate a prediction result of a labeled dataset for hate

speech detection. These classification algorithms have been

used in previous research on hate speech detection and

gen-erated positive prediction results [17][7][12]. An ensemble

classifier of Logistic Regression and Linear SVM that makes

use of a Hard Voting was also taken in consideration, but

was not used in this research because the performance of

this ensemble model was the same as the outcome of the

individual models with an accuracy of 85%. A workflow of

the selected baseline models can be depicted as follows:

Figure 5: Workflow of the classification algorithms

3.3.1 Text classification baseline setup

To ensure a good baseline for hate speech detection with

text-classification, the baselines were prepared accordingly.

The training dataset is split into five folds that were

ran-domly shuffled and Stratified cross validation was used to

measure the performance. As can be observed in the Figures

16 to 21 in Appendix A.1 Overview Class Balance, the classes

for every year are unequally distributed. Therefore,

Strati-fied K-fold was a suitable choice. Not only due to the small

sample size of the dataset, but also because this variation of

cross-validation ensures that the fold sample does not

imbal-ance the training data [26]23. The scikit-learn functionality TF-IDF Vectorizer was used to tokenize the features of the

corpus. By using this functionality, a text-document can be

converted to a list of tokens that can be given to the model

22. The parameter sub linear_tf was set to true, to obtain a

logarithmic frequency of 1 and to reduce the bias generated

by the length of the document. Use_idf was also set to true so

that the algorithm uses Inverse Document Frequency. This

means that the terms that appear most frequently to the

ex-tent they exist in the most documents get a weight assigned

to it that weights less then the weight that is assigned to a

term that appears less frequently but in specific documents.

The vector norm was set to L2 to reduce the document length

bias. Stop-words were set to English and the vector norm

was set to L2. By walking through these steps, each tweet

was represented by a set of features. The experiments were

run ten times.

3.3.2 Performance measurement

The output of each model was shown in boxplots that showed

the accuracy of the models. The highest performing model

was evaluated by displaying the actual versus the predicted

23

(11)

With human annotation platform Swivl and BERT

scores in a Confusion matrix. After that, the

misclassifica-tions and their cause were outputted, the most correlated

terms with each of their intents were shown and finally

the precision, recall and f1-score were displayed for each

of the labels. As precision and recall were ill-defined,

be-cause some labels had no predicted sample, the function

metrics.classification_report the parameter ’zero_division’

was set to 1.

3.4 BERT Base

For every year, and for all years combined, the performance

of the baseline models were compared to the pre-trained

transformer model BERT. As running BERT requires a GP U,

Google Colab was used since it provides the possibility to

use different GP Us and TCUs for free2425. The GP U that will be used for the purpose of this research is Tesla K80. In

contradiction to the classification algorithms, BERT extracts

its own features.

3.4.1 Text classification setup BERT

To obtain optimal performance, BERT was fine-tuned by

ad-justing its parameters and hyperparameters. Relevant for the

hyperparameters are the batch size, the learning rate and the

number of epochs. In contrast to the baseline models, a 90-10

split was used to create training and validation samples. In

this way, the dataset could be split randomly and be given

to the model so that it could be trained on unseen data.

For BERT, every year was trained and tested on separately

from 2015 to 2020 and all years combined were trained and

tested. In contradiction to the classification models, BERT is

pre-trained. This means that after importing the data, BERT

only has to be fine-tuned. As BERT works in a different way

than these models, after installing the Hugging face library

and the model was trained on a specific column to classify

the the labels: class_of_interest = ’intent_id’. After this the

data is cleaned and explored based on the intent_id. After

this, training and test data is extracted from the file and by

comparing the class_of_interest and the value_id before and

after extracting train and testdata, it is measured whether the

training dataset is not significantly different from the whole

dataset. After this BertTokenizer is loaded and the maximum

sentence length is printed. In addition to this, the sentences

are tokenized and the tokens are mapped to their word IDs.

Training samples and validation samples are extracted. The

batch size is set to 16, as this gave a better result than a batch

size of 64, the number of epochs is set to 4 and the learning

rate is set to default, which is 0. Seed_value is set to 42. The

24

https://research.google.com/colaboratory/faq.html

25

https://colab.research.google.com/drive/1Y4o3jh3ZH70tl6mCd76vz𝐼𝑥 𝑋 23𝑏𝑖𝐶 𝑃 𝑃 𝑠𝑐𝑟 𝑜𝑙 𝑙𝑇 𝑜=

5𝑙 𝑙 𝑤𝑢 8𝐺 𝐵𝑢𝑞𝑀𝑏

accuracy, f1-score and validation loss were calculated for

every epoch.

3.4.2 Performance measurement

The output for every year was shown through the

perfor-mance metrics accuracy and f1-score. The training loss and

validation loss were also outputted to monitor whether the

model was over- or underfitting.

4 RESULTS

The creation of a human labeled corpus with human

anno-tation platform Swivl showed that the user agreement is

not entirely equally distributed. The use of guidelines for

this annotation task and the choice to have annotators from

various backgrounds, shows that racial bias can be reduced

to some extend, as can be noted in the misclassifications of

hate versus offensive language.

With regards to the most suitable scope for text-classification

for hate speech detection in tweets, sentence-level would

be the most appropriate as the classes will be assigned to

separate tweets.

4.1 Results baseline classification algorithms 2020 Labeled Trump tweets

For 2020, 72 sentences were given to the model and 600

different features were extracted. All baseline models were

trained with Stratified five-fold. For all models, the second

and third parameter were adjusted to: Shuffle=True and

ran-dom_state=42. Only for the year 2020 the tables, boxplot

and Confusion matrix are added to the text. For the previous

years, the figures are included in Appendix B.1 Unigrams,

bigrams trigrams for hate speech, B.2 Boxplots and B.3

Con-fusion matrices.

It can be noted that each fold divides the data in equal parts

and that the class distribution of the labels are preserved in

the splits. The highest performing model is Logistic

Regres-sion with an accuracy score of 60%. The boxplot in Figure 6

shows the accuracy of all models of 2020. As Logistic

Regres-sion was the highest performing model, the data was trained

and tested on this model and the performance is shown in

the Confusion Matrix of 2020 in Figure 7. This figure shows

that Neutral Language was in 13 cases predicted correctly,

Hate Speech 0 times and Offensive Language 3 times. it is

highly likely that this is due to the size of the data, which

contained out of only 72 sentences. The word n-grams that

are most often correlated to hate speech, are:

. Top unigrams: . deaths . far

(12)

. Top trigrams: . H1N1 Swine Flu . Failing New York

The table below shows the performance of Logistic

Regres-sion in terms of preciRegres-sion, recall and f1-score. It can be noted

that for the labels hate speech and offensive language the

output is in some instances zero. This means that there were

no true positives. A recall of 1 means low precision, which

means that all tweets get the label hate speech. In the case of

a precision score of 1 means no false positives, which results

in a recall score of 0.

precision recall f1-score

Neutral 0.73 0.79 0.76

HateSpeech 0.00 0.00 0.00

OffensiveLanguage 0.38 0.20 0.43

Figure 6: Boxplot accuracy scores 2020

Figure 7: Confusion matrix 2020

2019 Labeled Trump tweets

For 2019, 458 sentences were given to the models and 2968

features were extracted. For the three classes, the most

cor-related unigram, bigram and trigram features were extracted

and weighted by their TF-IDF. Five fold Stratified

cross-validation was used to train and test the models. The highest

performing model is MultinomialNB with an accuracy of 61%.

The boxplot in Figure 14 in Appendix B.2 Boxplots shows

the accuracy of all models of 2019. The data was tested on

the highest performing model, MultinomialNB. The word

n-grams for hate speech are attached in Appendix B.1. The

table below shows the performance of MultinomialNB and

Figure 19 displays the Confusion matrix that belongs to this

model. It can be noted that Neutral was most often correctly

predicted with 80 examples, followed by OffensiveLanguage

with 5 examples. Hate speech was predicted correctly in 1

example. In the table it can be noted that for hate speech, a

precision of 30% is achieved, which corresponds with a low

false positive rate. A recall of 6% for this label is pretty low

as it implies that the ratio of the actual positive predictions

compared to all observations in the actual class is pretty low

as it is below 50%. Lastly, the f1-score is the weighted average

of precision and recall. In this case is can be noted that the

f1-score is low as it is 11%.

Neutral 0.56 0.99 0.71

Lastly, the misclassified examples were displayed. This

out-put shows that ’OffensiveLanguage’ predicted as ’Neutral’

in 49 examples, while hate speech was in 14 cases predicted

as Neutral.

features were extracted. Word n-grams were extracted and

Stratified five-fold cross-validation was applied. The highest

performing model was LinearSVC with an accuracy of 64%.

The boxplot in Figure 15 in Appendix B.2 shows the accuracy

of all models of 2018. The table below shows the performance

for LinearSVC and Figure 20 displays the corresponding

Con-fusion matrix. This shows that Neutral Language was in 33

cases predicted correctly, Offensive Language 3 times and

Hate Speech 5 times. It can be noted that for hate speech,

the f1-score is higher than for the previous years. This

cor-responds with the correctly predicted cases of Hate Speech,

(13)

With human annotation platform Swivl and BERT

Neutral 0.79 0.97 0.87

HateSpeech 1 0.42 0.59

features were extracted. The highest performing model is

MultinomialNB with an accuracy of %55. The boxplot in

Figure 21 in Appendix B.3 shows the performance of the

different models and Figure 21 shows the Confusion matrix.

Hate Speech was correctly predicted 0 times, Offensive

Lan-guage in 4 cases and Neutral LanLan-guage 15 times. In the table

below the metrics are displyed, which shows no true

posi-tives for Hate Speech.:

Neutral 0.48 1.00 0.65

HateSpeech 1 0.00 0.00

features were extracted. The highest performing model was

MultinomialNB with an accuracy score of 53%. The boxplot

in Figure 17 in Appendix B.2 shows the accuracy scores of

all models and Figure 23 in Appendix B.3 shows the

Confu-sion matrix. Hate Speech was predicted correctly in 7 cases,

Offensive Language in 14 cases and Neutral in 0 cases. The

table below shows the metrics:

Neutral 1.00 0.00 0.00

For 2015, 104 sentences were fed to the models and each

of these tweets is represented by 234 features. The highest

performing model is LinearSVC with an accuracy of 46%.

A boxplot of the scores is shown in Figure 22 in B.2 and a

corresponding Confusion matrix in Appendix B.3. Offensive

Language was predicted correctly 6 times, Hate Speech 5

times and Neutral Language 4 times. Offensive Language

was wrongly predicted as Neutral in 12 cases. The table

be-low shows the metrics:

Neutral 0.21 0.80 0.33

4.2 Results BERT base 2020 Labeled Trump tweets

For 2020, as the difference between the training and test data

was not significant, the data was be used to train the model.

The maximum sentence length is 83. 74 Training samples

and 5 validation samples were created. After training, the

outcome of the epochs showed that the training loss and

validation loss were high in the first epoch and lower in the

fourth, but that validation loss was slightly lower than the

training loss. This could be a result of over-regularizing the

model. In the table below the relevant metrics are shown: 1st

Accuracy: 0.29

F1-score 0.21

Average training loss 1.14

Validation loss 1.11

2nd

Accuracy: 0.57

F1 score: 0.24

3th

Accuracy: 0.57

F1 score: 0.24

4th

Accuracy: 0.43

F1 score: 0.20

The years 2018 to 2016 were included in Appendix B.4 BERT.

Only 2020 and 2015 are shown in this section, as these years

were the most far apart and therefore compared.

For 2015, the output after exploring the data is not

signifi-cantly different from the whole dataset and therefore

suit-able for training the model. The maximum sentence length is

49. Furthermore 83 training samples and 10 validation

sam-ples were created. After training, the outcome of the epochs

(14)

1st

Accuracy: 0.50

F1-score 0.22

Average training loss: 1.11

Validation Loss: 0.94

2nd

Accuracy: 0.60

F1 score: 0.44

3th

Accuracy: 0.60

F1 score: 0.44

4th

Accuracy: 0.60

F1 score: 0.44

It can be noted that the accuracy scores of 2015 are slightly

higher then the scores of 2020, but that difference of the

f1-score is more notable.The difference between the average

training loss and the validation loss remains similar.

5 DISCUSSION

As described in Section 3, a human annotated dataset can

from tweets can be generated in which bias is taken into

consideration. This is possible by means of guidelines in

combination with expert-annotators. When focusing on the

wrongly predicted examples, it is difficult to say whether

this is a result of personal bias. Examples of HATE SPEECH

predicted as NEUTRAL were:

- We are winning big time against China. Companies jobs

are fleeing. Prices to us have not gone up and in some cases

have come down. China is not our problem though Hong

Kong is not helping. Our problem is with the Fed. Raised too

much too fast.

- Wall Street Journal: More migrant families crossing into the

U.S. illegally have been arrested in the first five months of

the federal fiscal year than in any prior full year. We are

do-ing a great job at the border but this is a National Emergency!

With regards to the scope for text-classification, it can be

said that sentence-level is a suitable scope for hate speech

detection. For both the baseline models, that predicted

sen-tences of HATE SPEECH, and for BERT Base Uncased since

it can be used for Sequence Classification.

Furthermore, on the baseline models it can be said that the

performance of the models on the separate years was pretty

low. The best performing baseline models to generate a

pre-diction result for the years 2015 to 2020 were

Multinomi-alNB and Logistic Regression. The highest results for HATE

SPEECH were obtained for the year 2019 with an accuracy

of 61% with MultinomialNB and a precision of 70%. Since

a high precision corresponds with a low false positive rate,

it can be said that the model returns in this case more

rele-vant than irrelerele-vant results. For 2020 however, where only

73 sentences were fed to the model, a precision and recall

of <.0001 were returned for the label HATE SPEECH. This

means that there are no true positives and all results were

predicted wrongly. That makes sense when when taking into

regard the small size of the dataset. As the data was labelled

as HATE SPEECH, OFFENSIVE and NEUTRAL, it can be said

that for the baseline models most data was predicted to be

NEUTRAL.

Finally, about the performance of BERT it can be said that it

is slightly better than the performance of the baseline

mod-els, but worse when taking in regards the difference between

the training loss and the validation loss. For example, for

2015 the LinearSVC gives a score of 50% and BERT a score of

60% in the fourth epoch, but with a smaller average training

loss than a validation loss. This could mean that the model

is overfitting, which means it would not generalize well on

data it has not seen before.

6 CONCLUSION

It can be concluded that running a small dataset on both the

baseline models and the Google Transformer model BERT

does not lead to outstanding performance of the models. It

does show however, that the presence of political hate speech

can be detected in a small dataset, but that the models cannot

be generalized based on this study alone. Therefore, future

work is needed, especially with regards to bias in existing

datasets.

In terms of generalization of the model, it would have been

better to have more data for comparison. For example, to

compare Trump to another prominent figure or politician

with regards to hate speech and see how well the model

performs in this case. For example, a corpus of both Trumps

and Obama’s tweets could be compared to each other labeled

according to the three classes. Another way to measure how

well this corpus used for this study performs in comparison

to other datasets, would be to run the dataset of Davidson et

al. [7] on these models and compare the results.

(15)

With human annotation platform Swivl and BERT

the corpus of political tweets. With regards to the results,

it can be stated that the sample might have been too small

to generalize the results. The reason for this small sample

however, was that in this way for each separate year the

level of HATE SPEECH could be measured and therefore a

conclusion could be drawn about the presence and

develop-ment of hate in Donald Trump his tweets over the course of

six years, before and during his presidency.

Finally, using keywords to crawl a database in combination

with human annotations for creating a corpus of hateful

language could be a fruitful method for future research on

hate speech detection, when taking into regards personal

bias. These steps might not only help to reduce bias, but it

can also offer linguistic insights into methods for detecting

online hate.

ACKNOWLEDGEMENTS

I would like to thank Rodolfo Ramirez and Mason Levy from

Swivl for kindly and free of charge giving me the opportunity

to use a part of their platform for human annotation. This

made the task of annotation slightly easier for myself and my

coders. Secondly, I would like to thank Lester van der Pluijm

for providing me with helpful and extensive feedback during

the last phase of this thesis. Lastly, I am grateful to Reshmi

Gopalakrishna Pillai for supervising me on this project and

providing me with technical knowledge and suggestions.

REFERENCES

[1] Valerio Basile et al. “SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter”. In: SemEval@NAACL-HLT. 2019.

[2] Nina Bauwelinck et al. “LT3 at SemEval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter (hatEval)”. In: 2019, pp. 436–440. doi: 10.18653 /v1/s19- 2077. url: http://evalita.org.

[3] Shanita Biere and Master Business Analytics. “Hate Speech Detection Using Natural Language Processing Techniques”. In:Vrije Universiteit Amsterdam (2018), p. 30.

[4] Robert Alan Brookey and Brian L. Ott.Trump and Twitter 2.0. Jan. 2019. doi: 10.1080/15295036.2018.1546052.

[5] Alexander Brown. “What is so special about online (as compared to offline) hate speech?” In:Ethnicities 18.3 (June 2018), pp. 297–326. issn: 17412706. doi: 10.1177/1468796817709846.

[6] Thomas Davidson, Debasmita Bhattacharya, and Ingmar Weber. “Racial Bias in Hate Speech and Abusive Language Detection Datasets”. In: (May 2019). url: http://arxiv.org/abs/1905.12516.

[7] Thomas Davidson et al. “Automated hate speech detection and the problem of offensive language”. In:_{Proceedings of the 11th} Interna-tional Conference on Web and Social Media, ICWSM 2017. Mar. 2017, pp. 512–515. isbn: 9781577357889. url: http://arxiv.org/abs/1703. 04009.

[8] Gunn Enli. “Twitter as arena for the authentic outsider: exploring the social media campaigns of Trump and Clinton in the 2016 US presi-dential election”. In:_{European Journal of Communication 32.1 (Feb.} 2017), pp. 50–61. issn: 14603705. doi: 10.1177/0267323116682802. [9] Santiago González-Carvajal and Eduardo C. Garrido-Merchán.

“Com-paring BERT against traditional machine learning text classification”. In: (May 2020).

[10] Sameer Hinduja and Justin W. Patchin. “Connecting Adolescent Suicide to the Severity of Bullying and Cyberbullying”. In:Journal of School Violence 18 (2019), pp. 333–346.

[11] Jack Holland and Ben Fermor. “Trump’s rhetoric at 100 days: con-tradictions within effective emotional narratives”. In:Critical Stud-ies on Security 5.2 (May 2017), pp. 182–186. issn: 2162-4887. doi: 10.1080/21624887.2017.1355157.

[12] Muhammad Okky Ibrohim and Indra Budi. “Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter”. In: (2019), pp. 46–57. doi: 10.18653/v1/w19- 3506.

[13] Jean-Marie Kamatali. “The U.S. First Amendment Tradition and Ar-ticle 19”. In: (2013).

[14] Ramona Kreis. “The “Tweet Politics” of President Trump”. In:_Journal of Language and Politics 16.4 (Oct. 2017), pp. 607–618. issn: 1569-2159. doi: 10.1075/jlp.17032.kre.

[15] Robin T. Lakoff. “The hollow man”. In:Journal of Language and Politics 16.4 (Oct. 2017), pp. 595–606. issn: 1569-2159. doi: 10.1075/ jlp.17022.lak.

[16] Shervin Malmasi and Marcos Zampieri. “Challenges in discriminat-ing profanity from hate speech”. In:Journal of Experimental and Theoretical Artificial Intelligence 30.2 (Mar. 2018), pp. 187–202. issn: 13623079. doi: 10.1080/0952813X.2017.1409284. url: http://arxiv.org/ abs/1803.05495.

[17] Shervin Malmasi and Marcos Zampieri. “Detecting hate speech in social media”. In:International Conference Recent Advances in Natural Language Processing, RANLP. Vol. 2017-Septe. Dec. 2017, pp. 467–472. isbn: 9789544520489. doi: 10.26615/978- 954- 452- 049- 6- 062. url: http://arxiv.org/abs/1712.06427.

[18] Ricardo Martins et al. “Hate speech classification in social media using emotional analysis”. In:_{Proceedings - 2018 Brazilian Conference} on Intelligent Systems, BRACIS 2018. Institute of Electrical and Elec-tronics Engineers Inc., Dec. 2018, pp. 61–66. isbn: 9781538680230. doi: 10.1109/BRACIS.2018.00019.

[19] Marzieh Mozafari, Reza Farahbakhsh, and Noël Crespi. “A BERT-Based Transfer Learning Approach for Hate Speech Detection in Online Social Media”. In:Studies in Computational Intelligence 881 SCI (2020), pp. 928–940. issn: 18609503. doi: 10 . 1007 / 978 3 030 -36687- 2{\_}77.

[20] Karsten Müller et al.From Hashtag to Hate Crime: Twitter and Anti-Minority Sentiment *. Tech. rep. 2019. url: www.carloschwarz.eu,. [21] Sashaank Pejathaya MURALI. “Detecting Cyber Bullies on

Twit-ter using Machine Learning Techniques”. In:International Journal of Information Security and Cybercrime 6.1 (2017), pp. 63–66. issn: 22859225. doi: 10.19107/ijisc.2017.01.07.

[22] John T. Nockleby.HATE SPEECH. Tech. rep. Encyclopedia of the American Constitution, 2000, p. 2.

[23] Brian L. Ott.The age of Twitter: Donald J. Trump and the politics of debasement. Jan. 2017. doi: 10.1080/15295036.2016.1266686. [24] David Robinson, Ziqi Zhang, and Jonathan Tepper. “Hate speech

detection on twitter: Feature engineering v.s. feature selection”. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 11155 LNCS. Springer Verlag, 2018, pp. 46–49. isbn: 9783319981918. doi: 10.1007/978- 3- 319- 98192- 5{\_}9.

(16)

[25] Stephen Rushin and Griffin Sims Edwards. “The Effect of President Trump’s Election on Hate Crimes”. In:_{SSRN Electronic Journal (2018).} doi: 10.2139/ssrn.3102652. url: https://ssrn.com/abstract=3102652. [26] “SalkindNeilJ_2010_C_EncyclopediaOfResearc”. In: ().

[27] Joni Salminen et al. “Developing an online hate classifier for multiple social media platforms”. In:Human-centric Computing and Informa-tion Sciences 10.1 (2020), pp. 1–34. issn: 21921962. doi: 10 . 1186 / s13673- 019- 0205- 6. url: https://doi.org/10.1186/s13673- 019- 0205- 6. [28] Joni Salminen et al. “Online Hate Interpretation Varies by Country, But More by Individual: A Statistical Analysis Using Crowdsourced Ratings”. In:2018 Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS) (2018), pp. 88–94. [29] Anna Schmidt and Michael Wiegand.A Survey on Hate Speech

De-tection using Natural Language Processing. Tech. rep., pp. 1–10. url: https://en.wikipedia.org/wiki/List_.

[30] Yiwen Tang and Nicole Dalzell. “Classifying Hate Speech Using a Two-Layer Model”. In:Statistics and Public Policy 6.1 (2019), pp. 80– 86. issn: 2330443X. doi: 10.1080/2330443X.2019.1660285. url: https: //doi.org/10.1080/2330443X.2019.1660285.

[31] Zeerak Waseem. “Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter”. In: 2016, pp. 138– 142. doi: 10.18653/v1/w16- 5618. url: www.spacy.io.

[32] Zeerak Waseem and Dirk Hovy. “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter”. In: 2016, pp. 88–93. doi: 10 . 18653 / v1 / n16 - 2013. url: http : / / github . com / zeerakw/hatespeech.

[33] Matthew L Williams et al. “Hate in the Machine: Anti-Black and Anti-Muslim Social Media Posts as Predictors of Offline Racially and Religiously Aggravated Crime”. In:The British Journal of Criminology 60.1 (2020), pp. 93–117. issn: 0007-0955.

Appendices

A DATA

A.1 Overview class balance

Figure 8: Distribution of the labels 2015

(17)

With human annotation platform Swivl and BERT

B RESULTS

B.1 Unigrams, bigrams & trigrams for hate speech 2019 Labeled Trump tweets

. Top unigrams: . Mexico . illegal

. Top bigrams: . United States . Human Traffickers

. Top trigrams: . Our Country F ULL . strong immigration laws

2018 Labeled Trump tweets . Top unigrams: . crime . Border

. Top bigrams: . terrorist attack . We need

. Top trigrams: . Mexico stop onslaught . Southern Border

They

2017 Labeled Trump tweets . Top unigrams: . refugees . countries

. Top bigrams: . MS 13 . THE WALL

. Top trigrams: . help North Korea . President Xi China

2016 Labeled Trump tweets . Top unigrams: . ISIS . https

. Top bigrams: . Crooked Hillary . pay wall

. Top trigrams: . Mexico pay wall . elected President going

2015 Labeled Trump tweets . Top unigrams: . https . Muslim

. Top bigrams: . The media . WE NEED

. Top trigrams: . 11 Muslim Celebrations . Source 11 Muslim

B.2 Boxplots

(18)

Figure 16: Boxplot 2017

B.4 BERT Results 2019

For 2019, the output after exploring the data for is not

signif-icantly different from the whole dataset and therefore

suit-able for training the model. The maximum sentence length is

49. Furthermore 83 training samples and 10 validation

sam-ples were created. After training, the outcome of the epochs

(19)

With human annotation platform Swivl and BERT

1st

Accuracy: 0.63

F1-score 0.30

2nd

Accuracy: 0.67

F1 score: 0.35

3th

Accuracy: 0.76

F1 score: 0.58

4th

Accuracy: 0.78

F1 score: 0.60

Validation loss 0.66 Results 2018

For 2018, the output after exploring the data for the

This is not significantly different from the whole dataset and

therefore suitable for training the model. The maximum

sen-tence length is 49. Furthermore 83 training samples and 10

validation samples were created. After training, the outcome

(20)

1st

Accuracy: 0.63

F1-score 0.30

2nd

Accuracy: 0.67

F1 score: 0.35

3th

Accuracy: 0.76

F1 score: 0.58

4th

Accuracy: 0.78

F1 score: 0.60

For 2017, the output after exploring the data for the

This is not significantly different from the whole dataset and

therefore suitable for training the model. The maximum

sen-tence length is 83. Furthermore 88 training samples and 10

validation samples were created. After training, the outcome

of the epochs showed:

1st

Accuracy: 0.60

F1-score 0.29

2nd

Accuracy: 0.70

F1 score: 0.41

3th

Accuracy: 0.60

F1 score: 0.29

4th

Accuracy: 0.60

F1 score: 0.29

For 2016, the maximum sentence length is 55 and after the

90-10 train-validation split, 98 training samples were extracted

and 11 validation samples.

1st

Accuracy: 0.55

F1-score 0.24

2nd

Accuracy: 0.36

F1 score: 0.24

3th

Accuracy: 0.36

F1 score: 0.24

4th

Accuracy: 0.36

F1 score: 0.24

Validation loss 1.02 C Annotation guidelines

(21)

A supervised learning approach for political hate speech analysis on Twitter

hate

speech analysis on Twitter

With human annotation platform Swivl and BERT

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Kimberley van Ruiven

10800395

M

ASTER

I

NFORMATION

S

TUDIES

Information Systems

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

24 Augustus 2020

speech analysis on Twitter

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

Appendices

With human annotation platform Swivl and BERT

With human annotation platform Swivl and BERT

Annotation guidelines for hate

speech detection in tweets.

Welcome to my thesis annotation and thank you for

choosing to participate in this project.

You’re asked to read a given

set of tweets in English on

the data-labelling platform

Swivl. These tweets are

hav-ing issues related to

immi-gration, religion, nationalities

and countries as the main

topic, and, for each tweet,

you’re asked to answer some

questions regarding the

pres-ence or not of hate speech.

What is hate speech?

Online hate that is composed

of the use of language that

contains either hate speech

targeted toward individuals

or groups, profanity,

offen-sive language, or toxicity – in

other words, comments that

are rude, disrespectful, and

can result in negative online

and offline consequences for

the individual, community,

and society at large.

What does hate speech

include?

• insults, threats, denigrating

or hateful expressions

• incitement to hatred,

vio-lence or violation of rights

to individuals or groups

per-ceived as different for

so-matic traits (e.g. skin color),

origin, cultural traits,

lan-guage, etc.

• presumed association of

ori-gin/ethnicity with cognitive

abilities, propensity to crime,

laziness or other vices

• references to the alleged

in-feriority (or superiority) of