Community-based Abuse Detection U

(1)

Community-based Abuse Detection

Using Distantly Supervised Data and Biased Word

Embeddings

G.H.R. (Gerben) Timmerman Master Thesis Information Science G.H.R. (Gerben) Timmerman s2769670 July 28, 2020

(2)

A B S T R A C T

With the surge of user-generated content on the internet we see an increasing amount of messages being posted on social media. Alongside positive messages, negative messages that contain harassment, insults or other forms of abusive be-havior are becoming more prevalent. NLP researchers are trying to tackle this problem by building models that can automatically detect abusive language. They use datasets such as OLID (Zampieri et al.,2019a) which often consist of messages

that are collected using a keyword dictionary approach with words about a specific controversial topic. The main disadvantage of using this approach is that this could lead to bias towards those topics.

A study byRibeiro et al.(2018) shows that abusive users are densely connected

and use non-trivial vocabulary. This led to the idea that if we could learn the lan-guage patterns of abusive users, we could make better distinctions between abusive and non-abusive language. By collecting data from known abusive communities and other ’normal’ non-abusive communities we can cover a wider variety of topics and could characterize the language of abusive users towards those topics.

Therefore, the goal of this thesis is to find out whether data coming from abusive communities and other non-abusive communities can be used for the detection of abusive language.

The community-based data are collected from hateful communities and ’nor-mal’ subreddits on Reddit. With this data, we create distant datasets and generate task-specific polarized embeddings which are used to train abuse detection models. These models are tested both on an in-domain test set created in this research and on existing cross-domain test sets.

In conclusion, this study confirms that data coming from abusive and non-abusive communities can be used for the detection of non-abusive language. The results indicate that models learn to classify abuse from silver distant training data (even though they still get outperformed by smaller gold training data). Furthermore, models that use pre-trained biased abusive embeddings generated from this data are showing competitive results when compared against much larger pre-trained generic embeddings. Future research is encouraged to gather more varied data from additional abusive communities and to explore how models perform on a gold training set with messages gathered from abusive communities.

(3)

C O N T E N T S

Abstract i

Preface iv

1 introduction 1

2 background 3

2.1 What does Abusive Language Detection entail? . . . 3

2.2 Datasets can be biased towards explicit abuse . . . 3

2.3 Hateful Users use Similar Language . . . 4

2.4 Traditional models and Neural Network Architectures in Abuse De-tection . . . 4

2.5 Distant Supervision . . . 5

2.6 Approaches with embeddings . . . 5

3 data and material 7 3.1 Collection . . . 7 3.1.1 Silver Data . . . 7 Abusive Data . . . 7 Non-abusive Data . . . 8 3.1.2 Gold Data . . . 8 OffensEval’19 . . . 9 OffensEval’20 . . . 9 AbusEval . . . 9 3.2 Annotation . . . 9

3.2.1 Creating a Gold Test Set from the Reddit Domain . . . 9

Annotation Guidelines . . . 10

Students . . . 10

Self . . . 11

3.2.2 Existing Cross-domain Test Sets from the Twitter Domain . . . 11

OffensEval’19 & OffensEval’20 . . . 11

AbusEval . . . 12

3.3 Pre-trained Embeddings . . . 12

3.3.1 GloVe . . . 12

3.3.2 Creating Reddit Embeddings with fastText . . . 12

4 method 15 4.1 Data Preparation . . . 15

4.1.1 Preprocessing . . . 15

4.1.2 Distant Training Sets . . . 15

4.1.3 Label Distribution . . . 16

4.2 Models & Evaluation . . . 17

4.2.1 Support Vector Machine model . . . 17

4.2.2 Bi-LSTM model with Attention Layer . . . 17

4.2.3 Trainingprocess of the models . . . 19

4.2.4 Evaluation . . . 19

4.3 Experiments . . . 20

4.3.1 Using distantly supervised data . . . 20

4.3.2 Testing performance of biased Reddit embeddings on gold data 20 4.3.3 Supplementing gold data with distantly supervised data . . . 21

(4)

CONTENTS iii

5 results 22

5.1 Models Trained with Distantly Supervised Data . . . 22

5.1.1 Are we learning to detect abuse? . . . 22

5.1.2 Finding the best performing model . . . 23

5.1.3 Testing on cross-domain test sets . . . 24

5.2 Testing Effectiveness of Biased Abusive Embeddings on Gold Data . . 24

5.2.1 Binary models . . . 24

5.2.2 Multi-class models . . . 25

5.3 Using Distant data to supplement Gold Data . . . 26

6 discussion and conclusion 27 6.1 Discussion . . . 27

6.1.1 Summarization of Key Findings . . . 27

6.1.2 Interpretation and Summarization of the Results . . . 27

6.2 Conclusion . . . 28

6.2.1 Limitations . . . 29

6.2.2 Recommendations for Future Research . . . 29

Appendix a list of top 20 non-abusive communities 33 a.1 List of Top 20 Abusive communities . . . 33

a.2 List of Top 20 Non-abusive communities . . . 33

Appendix b data statement 35 Appendix c fleiss-kappa scores for groups of annotators 37 Appendix d inspection of created reddit embeddings 38 Appendix e label distribution scores 43 Appendix f specific results 45 f.1 Experiment 1: Training Models with Distantly Supervised Data . . . . 45

f.2 Experiment 3: Training Models on Gold Data with Supplemental Dis-tant Test Sets . . . 45

f.2.1 Binary Models . . . 45

(5)

P R E F A C E

Before you lies my Master’s Thesis Community-based Abuse Detection: Using Distantly Supervised Data and Biased Word Embeddings. It focuses on detecting abusive lan-guage in messages with data coming from abusive and non-abusive communities. The thesis has been written to fulfill the graduation requirements of the Master Communication & Information Studies (Information Science) at the University of Groningen.

The study was undertaken during the last year of my study between the months of January and July 2020 amidst the corona crisis. My research questions were formed together with my supervisor, Tommaso Caselli. It was a very interesting journey and although it was quite difficult, I have learned a lot along the way.

I would like to thank my supervisor Tommaso Caselli who helped me with difficulties along the way in this research. He was always available and gave advice on how to approach some difficult aspects. Additionally, I would like to say thanks to my fellow student Louis de Bruijn who joined Tommaso and me during the weekly meetings. Without their support and advice, this research would not have been possible.

Furthermore, I want to thank Valerio Basile for sharing the abusive data and his code to extract data from the archived Reddit files.

I would like to thank Glenn van der Linde for proof-reading my thesis and for giving valuable feedback on the writing and cohesiveness.

At last, I would like to thank the Center for Information Technology of the Univer-sity of Groningen for their support and for providing access to the Peregrine high performance computing cluster.

I hope you enjoy your reading. Gerben Timmerman

Groningen, July 28th, 2020

(6)

1

_{I N T R O D U C T I O N}

With the surge of user-generated content on the internet we see an increasing amount of messages being posted on social media. People from all over the world can connect with each other, engage in productive discussions and share informa-tion. Alongside these positive messages, negative messages that contain harass-ment, insults or other forms of abusive behavior are becoming much more preva-lent. (Kennedy et al.,2017) claim that it has become impossible to manually monitor

every message effectively that is being posted to a platform due to its slow report-ing process and reliance on humans. They plead to automate this process by usreport-ing Natural Language Processing (NLP) methods, specifically machine learning mod-els. These models can either help human moderators focus on the more relevant messages or even prevent abusive messages from being posted in the first place.

Researchers in the NLP field have tried to tackle this problem by organizing sev-eral shared tasks such as the FIRE 2019 HASOC track (Mandl et al.,2019), SemEval

2019task 5 HatEval (Basile et al.,2019) and task 6 OffensEval (Zampieri et al.,2019b)

that focus on building models that can improve the detection of Offensive language. Datasets such as OLID (Zampieri et al.,2019a), which is used for OffensEval’19,

often consist of messages that are collected using a keyword dictionary approach with words about a specific controversial topic. The main disadvantage of using this approach is that this could lead to bias towards those topics.

A study byRibeiro et al.(2018) showed that abusive users are densely connected

and use non-trivial vocabulary. This led to the idea that if we were to learn the language patterns of abusive users, we could make a better distinction between abusive and non-abusive language. By collecting data from abusive communities and other ’normal’ non-abusive communities. we can cover a wider variety of topics and language towards those topics than traditional dictionary-based approaches.

This thesis aims at improving the detection of abusive language through the means of creating models with distantly supervised data coming from hateful/abu-sive communities and other non-abuhateful/abu-sive communities. The messages coming from these two types of communities will be used to create distant training sets and for the production of task-specific polarized embeddings.

This leads to the investigation of the following main research question and sub-questions that help to answer the main question:

1. Can data collected from abusive and non-abusive communities be used to detect abusive language in messages?

[1a.] Can this be achieved by training models with a distantly supervised dataset?

[1b.] Do pre-trained biased embeddings for abusive language perform better than generic pre-trained embeddings?

[1c.] Does the combination of distantly supervised data and existing ’golden’ datasets get better results?

[1d.] How do the models perform on cross-domain test sets? The data source most ideal for this research would be Reddit1

. Reddit is a network of communities that are based on people’s interests. This includes abusive and non-abusive communities.

The hypothesis is that data from communities can be used to help in the detec-tion of abusive language. Be it in the way of a distantly supervised training set 1

www.reddit.com

(7)

introduction 2

or through creating biased embeddings that provide more task-specific meaning to words than generic global pre-trained embeddings.

To answers the research questions it is useful to know a bit more about the theoretical background which is presented in the next chapter. Then a description of the data and material is provided. The method chapter gives a detailed overview of the data preparation, machine learning models, evaluation process and the three experiments conducted. Next, we present the results in the results chapter and thoroughly discuss them in a separate discussion chapter. At last, we provide a conclusion where we answer the research questions, discuss limitations and provide advice and recommendations for future research. All the code used in this research can be found on our github2

.

2

(8)

2

_{B A C K G R O U N D}

This chapter presents background information and previous work in the area of abu-sive language detection. First, we go into more detail about what is considered to be abusive language. Secondly, we discuss a paper that goes more in-depth about the bias of explicit abusive in popular datasets. The third section presents a study that claims abusive users use similar language patterns and how this can be a solution to solving bias towards explicit abuse. Next, we discuss the performance of traditional baseline machine learning models compared to more recent deep neural network models that are more complex. In the fifth section, we provide more information about distant supervision. At last, we elaborate on the impact of generic pre-trained embeddings and task-specific polarized embeddings in document classification and abusive language detection in general.

2.1 what does abusive language detection entail?

Research in the field of abusive language detection is becoming more popular amongst the NLP community. This can be seen by the variety of subtasks that are emerged over the years that are related to abusive language detection. Tasks such as the detection of hate speech, cyberbullying and offensive language are ex-amples of subtasks that have gained quite the amount of attention. Waseem et al.

(2017) found that researchers have to group these tasks together under different

terminologies like ’abusive language’, ’harmful speech’ and ’hate speech’ because they contain many similarities. However, they did not take into account the relation-ships between them as each task addresses a different aspect of abusive language.

Waseem et al. (2017) point out that these different terminologies led to

contradic-tions between annotacontradic-tions guidelines. Hence, this is why they propose a typology that makes it possible to create an annotation scheme that groups these different subtasks together. This scheme is built by reducing the differences in abusive lan-guage to two factors: (1) Direction of the lanlan-guage towards a specific individual or a generalized group; and (2) Whether the abusive content contains either explicit content or implicit content. With the use of this annotation scheme, they clarified the overlapping and distinguishing aspects between the different subtasks. They say this can lead to a better consensus about the definition of abusive language and its subtasks. This in turn will help researchers select appropriate strategies for data annotation and modeling.

2.2 datasets can be biased towards explicit abuse

Caselli et al. (2020) investigated a dataset that followed the typology of Waseem et al.(2017). This dataset is the existing and popular data set for offensive language

detection: OLID/OffensiveEval (Zampieri et al., 2019a). In their analysis Caselli et al.(2020) found that the content of offensive labels was highly skewed towards

explicit markers for finding offensive language. This raises a problem because ex-plicit abuse is easier to find compared to imex-plicit abuse. By using a dictionary with offensive terms we can easily detect messages that contain explicit abuse. However, implicit abuse can be disguised as ’normal’ language and be woven into the whole meaning of a sentence without the presence of any explicit markers. By building

(9)

2.3 hateful users use similar language 4

models that mainly focus on explicit abuse, messages with implicit abuse are more often not recognized by the model.

They started manually annotating the messages from OLID/OffensiveEval with explicit and implicit labels (Caselli et al.,2020). This showed that after annotation,

the training and test set contained 65% and 59% explicit messages, respectively. Besides, they looked at two other datasets fromWaseem and Hovy(2016) and

Hat-Eval (Basile et al., 2019). For each dataset, they sampled 1000 abusive messages

and found that the majority of them is realized by explicit markers (38% in Waseem and Hovy (2016b) 19, and up to 98% in HatEval). Based on this result they plead for a reflection and development of new strategies on how datasets for offensive messages are built. One of the directions they advocate to use is a user-based ap-proach instead of solely building on keywords (Caselli et al.,2020). The idea is to

collect data from users that are more likely to use offensive/abusive language. This may lead to reducing the bias that is often seen in explicit expressions and could increase the classification of more complex and implicit messages. This research tries to build on the suggestion of using a used-based approach.

2.3 hateful users use similar language

Ribeiro et al. (2018) did a study on characterizing and detecting hate speech by

focusing on Twitter users. Their analysis presents a user-based approach as sug-gested byCaselli et al.(2020), which exploits the retweet network, that outperforms

content-based approaches for the detection of hateful messages and suspended users (95% vs 88% AUC). Hateful users differ from normal users with respect to their word usage, user activity patterns and network centrality measurement (Ribeiro et al.,2018). They compared this behavior to data from suspended users

and found similar results. Based on this result, we think that language used in hateful communities can help in detecting language patterns that are common for abusive messages uttered by abusive users. This way we might detect both more explicit and implicit abuse in messages regarding a certain topic.

2.4 traditional models and neural network

ar-chitectures in abuse detection

Traditional models that use algorithms such as Support Vector Machines (SVM) and Logistic Regression are known to achieve decent results in abusive language detection, however, they are getting outperformed by neural network approaches. A study byBadjatiya et al.(2017) investigated the use of deep learning models for

hate speech detection in tweets. Their task was to classify a tweet as racist, sexist or neither by using a dataset of 16K annotated tweets byWaseem and Hovy(2016).

They experimented with multiple traditional baseline classifiers such as Logistic Regression, Random Forest, SVMs, Gradient Boosted Decision Trees (GBDTs) and more recent and complex Deep Neural Networks (DNNs). For the baselines, they use feature spaces comprising of character n-grams, TF-IDF vectors, and Bag of Words vectors (BoWV) using the average of the word (GloVe) (Pennington et al.,

2014) embeddings to represent a sentence. The feature spaces for the DNNs are

de-fined by either GloVe embeddings or by training randomly initialized task-specific embeddings. These were then used in three deep learning architectures: FastText (Joulin et al.,2016), Convolutional Neural Networks (CNNs) (Kim,2014) and Long

Short-Term Memory Networks (LSTMs) (Hochreiter and Schmidhuber,1997). The

classification models with their respective features were then tested in three parts. Part A consisted of the baseline methods. Part B contained only neural network

(10)

2.5 distant supervision 5

methods. At last, part C used the average of word embeddings learned by DNNs as features for GBDTs. Badjatiya et al.(2017) concluded that the use of DNNs

re-sulted in significantly better scores than the traditional baseline methods.

2.5 distant supervision

Supervised classification models tend to rely on training data that is annotated by humans. However, the messages in this research will not be labeled by humans, but by using a technique called distant supervision. In distant supervision, an already existing dataset is used to collect examples for the relation (label) that we want to extract. These examples are used to automatically generate our training data ac-cording to some filtering criteria. This is an effective way to create a large amount of training data that are both more scalable (portability to different languages or do-mains) and versatile (time and resources needed to train), than humanly-annotated data in supervised learning models. (Merenda et al.,2018). The training data will

most likely contain a lot of noisy data that do not represent the relation. Hence this is why distant supervision is known to be a weak supervision method to produce (silver) data as opposed to humanly-annotated (gold) training data. Gold training data contains fewer bad examples that do not represent the relation and are hence of higher quality.

An example of distant supervision is presented in the paper ofGo et al.(2009).

They used emoticons to annotate positive and negative sentiment in Twitter mes-sages. For example, a tweet that contains the tokens :) expresses positive sentiment and a tweet that contains :( expresses negative sentiment. They manually anno-tated a golden test set of 177 tweets with negative sentiment and 182 tweets that contained positive sentiment making it a total of 359 tweets. They collected their test set using two steps. The first step making use of the Twitter API by searching with queries for different categories. The queries being company names, consumer products, locations, events, movies and people. The second step was looking at the results from the queries and if they saw sentiment in the tweet they would an-notate it as either positive or negative. This way the test set does not just contain tweets with emoticons. This distantly supervised approach achieved accuracy re-sults above 80% with various machine learning algorithms like Naïve Bayes, SVM and MaxEnt (Maximum Entropy). In conclusion,Go et al.(2009) showed that it is

possible to annotate tweets automatically with some filtering instructions and use the distantly annotated data to train models.

2.6 approaches with embeddings

Probably the biggest jump when moving from traditional linear models with sparse inputs to deep neural networks is to stop representing each feature as a unique di-mension, but instead represent them as dense vectors (Goldberg, 2015). Instead

of using discrete representations, words are stored into a high-dimensional feature space that represents each word by a lower-dimensional dense vector (aka. embed-ding). These embeddings are a type of word representations that allows words with similar meaning to have a similar representation.

(D’Sa et al.,2020) performed toxic speech classification using two relatively

re-cent and powerful pre-trained word representations: fastText (Bojanowski et al.,

2016) and BERT (Devlin et al., 2018) embeddings. These embedding

representa-tions were used as inputs to DNN classifiers, namely CNN and Bi-LSTM classifiers. The dataset they used comes fromDavidson et al. (2017) and contains messages

collected from Twitter. They experimented with both binary (Toxic/Not toxic) and class (Hate Speech/Offensive Language/Neither) classification. For the

(11)

multi-2.6 approaches with embeddings 6

class classification, they wanted to distinguish toxic speech more finely into hate speech and abusive speech. This approach resulted in impressive macro-f1 scores. The scores for the binary classification task are ranging between 90.9% and 91.9% and for the multi-class task between 70.9% and 72.4%. D’Sa et al.(2020) show that

implementing embeddings into DNNs can result in powerful models for the task of abuse detection.

A thesis byGraumans (2019) aimed to create an effective way to automatically

detect abusive language in online user content using polarized word embeddings from controversial social media data as feature space to better model phenomena like offensive language, hate speech and other forms of abusive language. The data were collected from Twitter, Reddit, 4chan and 8chan. The collected data were then used to generate generic and polarized word embeddings using the GloVe ( Penning-ton et al.,2014) and fastText (Bojanowski et al.,2016) embedding algorithms. Next,

he built a Bi-LSTM network which in combination with the polarized fastText Red-dit embeddings outperformed other source-driven representations, as well as the generic pre-trained fastText embeddings.Graumans(2019) shows that smaller

task-specific pre-trained embeddings from abusive data can outperform larger generic embeddings.

(12)

3

_{D A T A A N D M A T E R I A L}

In order to build a machine learning model a large amount of training data is needed to make accurate predictions. First, an elaboration is given on the collec-tion process of distantly supervised (silver) data from abusive and non-abusive communities that are present on Reddit. Subsequently, the collection of humanly supervised (gold) data will be discussed. In the second section, we dive deeper into the annotation process of the aforementioned data. At last, more information about the pre-trained embeddings is given.

3.1 collection

3.1.1 Silver Data

The silver data is created with distant supervision. We make use of already existing datasets coming from the medium Reddit. Reddit is a network of communities that are based on people’s interests. This includes abusive and non-abusive communi-ties. The Reddit data is published openly and can be found on pushshift.io.1

The datasets contain comments that have been posted during a month of a particular year. These messages represent the examples from which we want to extract a re-lation (label) using filtering criteria. The distantly assigned rere-lations are Abusive (explicit/implicit) and Non-abusive labels. We use abusive communities to assign the Abusive label and non-abusive communities to assign the Non-abusive labels.

Abusive Data

Finding toxic and abusive communities proved to be a difficult task. After doing some investigative work, we found a couple of lists with banned subreddits that proved to be insightful and usable for the extraction of abusive communities.2 3 4 The abusive subreddits used in this research were then manually narrowed down from these sources to a list shown in Table1.

Training Communities Test Communities

holocaust pol nazi misogyny milliondollarextreme hitler fatpeoplehate niggas sjwhate beatingfaggots europeannationalism niggerrebooted uncensorednews kike killniggers niglets killthejews polacks far_right niggervideos niggerspics chimpmusic teenapers 1 https://files.pushshift.io/reddit/comments/ 2 https://en.wikipedia.org/wiki/Controversial_Reddit_communities#Banned_subreddits 3 https://www.reddit.com/r/Reddit_Hate_Groups/comments/3e1hmm/the_chimpire/ 4 https://www.reddit.com/r/reclassified/comments/cglpsq/list_of_all_known_banned_subreddits_ sorted/ 7

(13)

3.1 collection 8 chicongo apewrangling niggerstories gibsmedat muhdick funnyniggers niggerhistorymonth didntdonuffins blackpeoplehate klukluxklan whitesarecriminals

Table 1: List of abusive communities used in the creation of the distant training data and test data.

The data consists of messages that are collected between the years 2012 - 2017. The messages are then distantly annotated based on a lexicon containing offensive terms fromWiegand et al. (2018). This lexicon contains words with an associated

offensiveness score and we only take terms into account with a score above the limit of 0.75. When an offensive term is found in the Reddit comments, the message gets labeled as explicit. If not, the message gets labeled as implicit. In total 2.803.506 comments are collected from 34 unique subreddit communities of which 2.152.541 are labeled as Implicit and 650.965 as Explicit. The subreddits provide an average of 82.456 messages in the training set. The minimum amount of messages for an abusive subreddit is 1 and goes all the way up to 1.465.531 as a maximum. The 20 subreddits with the most messages can be found in AppendixA.1.

Non-abusive Data

In order to get a more varied dataset of non-abusive messages, the data is collected from a time period between the years of 2012 until 2017. For each year within that time span, an extraction of 250.000 comments takes place for each of the following months: January, April, July and October. This amounts to a total of 6.000.000 Red-dit comments. The comments are collected from subredRed-dits that are not present in the abusive dataset. The non-abusive data are collected from 28.523 unique subred-dits. Every message/comment has been posted in a subreddit. Therefore, the total occurrences of subreddits are equal to the number of documents. The subreddits provide an average of 101.39 messages in the training set. The minimum amount of messages for a subreddit is 1 and goes all the way up to 276.204 as a maximum. The 20 subreddits with the most messages can be found in AppendixA.2.

3.1.2 Gold Data

Gold data is data that is annotated by humans to provide a dataset of higher qual-ity instances as opposed to silver data which is annotated by a computer using a manual filtering process that is more prone to errors.

OffensEval’19 OffensEval’20 AbusEval

NOT ABU NOT ABU NOT ABU

train 8.840 4.400 na. na. 10.491 2.749

dev na. na. na. na. na. na.

test 620 240 2.807 1.080 682 178

total 14.100 3.887 14.100

(14)

3.2 annotation 9

Multi-class AbusEval Non-abusive Implicit Explicit Train 10.491 726 2.023

Dev na. na. na.

Test 682 106 72

Table 3: Data distribution for AbusEval with multi-class labels (NOT,IMPandEXP)

OffensEval’19

OffensEval2019 uses the Offensive Language Identification Dataset (OLID) intro-duced by Zampieri et al. (2019a), which has been created specifically for the

de-tection of offensive language. This data builds forward on the proposed typology byWaseem et al. (2017) as described in section2.1. The dataset consists of 14.100 tweets of which 13.240 were used to create the training set and 860 for the test set. The distribution for the labels of the training set (NOT/OFF) is 8.840 and 4.400, respectively. The test set has 620 non-offensive labels and 240 offensive messages. This training set can be used for training models to solve binary abusive language classification.

OffensEval’20

OffensEval’20 is the official test set for the SemEval 2020 Task 12: OffensEval shared task (Zampieri et al.,2020). The test set follows the same annotation scheme as

proposed in the OLID dataset fromZampieri et al. (2019a). This scheme will be

further discussed in section 3.2.2. The test set is composed of 3.887 messages of which 2.807 are categorized as non-offensive and 1.080 as offensive.

AbusEval

The dataset from AbusEval (Caselli et al.,2020) is an enriched version of the OLID

dataset (Zampieri et al.,2019a) used in OffensEval’19. The researchers noticed that

there is a lack of data sets that take into account the degree of explicitness. They propose new annotation guidelines to make better distinctions between explicit and implicit abuse in English and applied them to OLID/OffensEval. This led to the creation of the AbusEval dataset where they manually annotated offensive messages with explicit and implicit labels. The dataset contains 14.100 messages of which 13.240 were used to create the training set and 860 for the test set. The training data consists of 10.491 non-abusive messages and of 2.749 abusive (726 implicit, 2.023 explicit) messages. AbusEval can be used for training and testing binary and multi-class classification models.

3.2 annotation

In this section, we provide more details about the annotation processes used for the creation of test sets used in this thesis. Testing is going to be done on a test set coming from the same domain as the training data to show the models’ effectiveness on similar data. Additionally, we want to see how the models would perform on data coming from different sources. Therefore, we also chose to include test sets from different domains in the testing process. The distribution of the test sets can be found in Tables4and5.

3.2.1 Creating a Gold Test Set from the Reddit Domain

Since the training data is coming from the Reddit domain, we need to build a test set with gold data from the same domain to test the performance of the models. This

(15)

3.2 annotation 10

Test set

Labels

Abusive Non-abusive Total

Gold Reddit 854 1.708 2.562

OffensEval’19 240 620 860

OffensEval’20 1.080 2.807 3.887

AbusEval 682 178 860

Table 4: Label distribution for the binary test sets

Test set

Labels

Non-abusive Implicit Explicit Total

Gold Reddit 1.708 383 471 2.562

AbusEval 682 106 72 860

Table 5: Label distribution for the multi-class test sets

test set will be called Gold Reddit and will have a distribution of 33.33% abusive labels and 66.66% non-abusive labels. The Data Statement for the annotators can be found in AppendixB.

Annotation Guidelines

The annotations for the Gold Reddit test set are conducted using three possible label values: EXPLICIT, IMPLICIT and NOT. The focus of the annotation is on the content of the message. It takes into account how the message is realized. The level of annotation focuses on the assumed intentions of the users as well as the effect of the receivers.

• A message is marked as EXPLICIT if it is interpreted as potentially abusive (in-tension of the speaker to debase/offend; effect on the message on the receiver) if it contains profanity or a slur and it has a target (a person or a member of a group).

• A message is marked as IMPLICIT if it is interpreted as potentially abusive (intension of the speaker to debase/offend; effect on the message on the re-ceiver) if it DOES NOT contain profanity or a slur and it has a target (a person or a member of a group).

• A message is marked as NOT if it is not interpreted as abusive. It can be aggressive but it does not aim at debasing the receiver. It may not have a target.

Table6lists some examples of Reddit messages and their corresponding labels.

Message Label

best thread ever! my belly hurts NOT

Future single mother IMP

go jump of a cliff and die painfully you son of a bitch EXP I am extremely satisfied. also moar winning plz NOT you have serious mental issues LOL IMP this sub’s bigorty and this video are unrelated, faggot EXP

Table 6: Examples of messages in the test set with their corresponding label

Students

A total of 40 students were divided into 12 groups where each group consists of 3-5 persons. They annotated randomly selected messages from non-abusive communi-ties coming from the following months:

(16)

3.2 annotation 11

• February, 2016

• May, 2017

• August, 2017

These months are selected to make sure that none of the data in the test set occurs in the training set. Each annotator had to label 50 common messages that are shared between all members of a group and 100 individual messages. In total 4.746 messages were annotated by the students.

With the collection of common messages, the Fleiss-kappa score (Fleiss et al.,

1979) was calculated which measures the agreement between the members in a

group of variable sizes. This has been done for each group and the results are stored in AppendixC. The Fleiss-kappa score determines how reliable the annotations are within a particular group.

The labels for the common messages are chosen based on majority voting and the messages are then combined with the individual messages and labels. This leads to the following distribution:

• 388abusive messages (244 implicit and 144 explicit)

• 4.358 non-abusive messages

Based on this distribution we make an interesting observation, namely, that there are very few abusive messages being posted in non-abusive communities.

Self

To make Gold Reddit a more substantial test set with a 33/66 distribution, more abusive labels are needed. By using a subset from the list of abusive subreddits for testing and excluding them from training, we collect test data that is more likely to be abusive. The messages coming from these subreddits in the subset were then manually annotated. By excluding this subset from the training data we make sure that the models are not learning these communities. In total, 139 implicit and 327 explicit messages were added to the test set. To get the 33/66 ratio we had to exclude some of the non-abusive messages. We end up with the following distribution for the Gold Reddit test set:

• 854abusive messages (383 implicit and 471 explicit)

• 1.708 non-abusive messages

3.2.2 Existing Cross-domain Test Sets from the Twitter Domain

OffensEval’19 & OffensEval’20

The OffensEval’19 and OffensEval’20 data consist of tweets that are annotated for offensive content following the same hierarchical annotation scheme. This scheme is made out of the following three layers:

A. Offensive Language Detection

B. Categorization of Offensive Language C. Offensive Language Target Identification

Layer A determines whether a tweet contains offensive language (OFF) or not (NOT). The second layer B categorizes the type of offense, whether it is targeted (at a group/individual) or not. If a tweet is targeted, the third layer C determines who the message is targeting (Individual, Group or Other). Since the focus lies on detecting abuse from messages only the labels from layer A are used.

(17)

3.3 pre-trained embeddings 12

AbusEval

The annotations from AbusEval (Caselli et al., 2020) are an additional more

fine-grained annotation layer on top of the OLID (Zampieri et al.,2019a) dataset. It takes

the first layer which specifies whether abusive language is present and evaluates whether the abusive messages contain explicit (EXP) or implicit (IMP) abuse. This opens up a way to build models that can make distinctions between non-abusive, implicit and explicit messages.

3.3 pre-trained embeddings

In our models, we use pre-trained word embeddings. These are generated based on a large dataset and capture the semantic and syntactic meaning of a word. They save us a tremendous amount of training time and computation costs when doing experiments. In this thesis, we want to compare the performance of more generic global embeddings with self-generated biased embeddings.

3.3.1 GloVe

For the generic more global embeddings we choose the well-known pre-trained Global Vectors for Word Representation (GloVe) embeddings produced by Penning-ton et al. (2014). GloVe takes advantage of the global context. The model learns

vectors of words from their co-occurrence information, i.e. how frequently they ap-pear together in the given corpus (Graumans,2019). The GloVe embeddings used

in this research are trained on a common crawl of web data. These embeddings are build based on 840 billion tokens, a vocabulary of 2.2 million words that are cased and processed into 300-dimensional vectors.

3.3.2 Creating Reddit Embeddings with fastText

For the generation of biased embeddings that are based on data coming from abu-sive and non-abuabu-sive communities, we make use of the fastText algorithm and li-brary created by Facebook Research Team for efficient learning of word represen-tations and sentence classification (Bojanowski et al., 2016). FastText is based on

the skipgram model introduced by (Mikolov et al., 2013) and uses byte-pair

en-coding for word segmentation and learns embeddings for these substrings. The embeddings of words are then represented by the sum of those substring vectors.

Bojanowski et al. (2016) mention that one advantage of fastText is that it always

returns a vector even if it comes across an unknown word that is not present in the training data. This results in better coverage of words.

We will make two separate models of fastText embeddings that are trained on the data coming from the Reddit communities. The first model is trained on data coming from abusive communities. The purpose of this model is to be more sen-sitive and show bias towards abusive language uttered in messages. The second model is trained on data coming non-abusive (normal) communities. The idea of using non-abusive embeddings is to capture different meanings of words when they are uttered by non-abusive users. All default parameters for training the models are kept with a few exceptions which are shown in Table7.

The model is built following the skipgram model (Mikolov et al., 2013) and

words are represented in a 300-dimensional dense vector space. The parameter wordNgrams which represents the maximum length of word n-grams is set to 2. Additionally, all words with a minimum frequency of 1 are taken into account during the training process.

(18)

3.3 pre-trained embeddings 13 Parameters Values model skipgram minCount 1 wordNgrams 2 dim 300

Table 7: Parameters used in the fastText model for the creation of Reddit embeddings.

Reddit fastText embeddings Words Vocabulary Abusive communities 174M 511.409 Non-abusive communities 200M 1.300.342

Table 8: Total amount of processed words and word vocabulary size for the creation of fast-Text embeddings from Reddit communities

Table8 shows that we end up with 174 and 200 million words scanned for the abusive and non-abusive communities, respectively. The abusive embeddings end up with a vocabulary of 511.409 and the non-abusive embeddings with a much larger vocabulary of 1.300.342. This can be explained by the fact that we collected data from a larger amount of unique communities. Therefore, the topics vary more and this leads to a larger vocabulary.

We did a small intrinsic evaluation where we investigate the quality of our Red-dit embeddings when asked to retrieve the nearest neighbors of relevant words that are often associated with abuse. By looking at the nearest neighbors, we can tell whether they are similar or what relation they have to the target word or token. Three examples of such target words are presented in Tables9and10.

Words Nearest neighbours from abusive fastText embeddings

black (0.7780448198318481, ’white’), (0.7417296171188354, ’black-black’), (0.7213969230651855, ’Black’), (0.7143608927726746, ’blackwhite’), (0.7065100073814392, ’blackss’), (0.7049232125282288, ’365black’), (0.6940803527832031, ’us-black’), (0.6887308359146118, ’blacks’), (0.6869428157806396, ’black-African’), (0.685642659664154, ’ablack’) democrats (0.9266827702522278, ’democrats.and’), (0.9262681603431702, ’emocrats’), (0.8747307062149048, ’Democrats’), (0.8668286204338074, ’democratscum’), (0.8546227216720581, ’democratcy’), (0.8490886688232422, ’democrats-party’), (0.835618793964386, ’democrats–the’), (0.8285972476005554, ’re-publicans’), (0.8265460729598999, ’democrat’), (0.8237749338150024, ’dimocrats’) republicans (0.8874382376670837, ’Republicans’), (0.8776662945747375, ’todaysRepublicans’), (0.8592983484268188, ’republican’), (0.8497137427330017, "republican’s"), (0.839995801448822, ’re-publicannns’), (0.8374894261360168, ’Ebola-Republicans’), (0.8350982069969177, ’reublicans’), (0.8348020911216736, ’republicanazis’), (0.8308549523353577, ’Trumpublicans’), (0.8285977244377136, ’democrats’)

Table 9: Inspection of the abusive Reddit embeddings for the target words black, democrats and republicans by looking at their nearest neighbours.

(19)

3.3 pre-trained embeddings 14

Words Nearest neighbours from non-abusive fastText embeddings

black (0.8524320721626282, ’blackxblack’), (0.8397740721702576, ’white’), (0.8042157888412476, ’grey-black’), (0.7941656112670898, ’Pdxblack’), (0.787697970867157, ’black-n-white’), (0.7872133255004883, ’black-vs-white’), (0.7849933505058289, ’white-black’), (0.7780194282531738, ’brown-’white-black’), (0.7761123776435852, ’not-black’), (0.7745722532272339, ’huge–black’) democrats (0.9227805733680725, ’democrats–in’), (0.9174736142158508, ’dwmocrats’), (0.9117075204849243, "democrat’s"), (0.91120445728302, ’republicans’), (0.9030013084411621, ’pro-democrats’), (0.9023128747940063, ’Democrats’), (0.9010171294212341, ’democrat’), (0.885067880153656, ’democrates’), (0.8808586001396179, ’epublicans’), (0.8779354095458984, ’republicans–or’) republicans (0.9606856107711792, ’Яepublicans’), (0.9531849026679993, ’republicans–or’), (0.9333056211471558, ’Republicans’), (0.9323707818984985, "tax’-Republicans"), (0.9239442944526672, ’megarepublicans’), (0.9237703084945679, ’republican’), (0.9183163642883301, "republican’s"), (0.9112045168876648, ’democrats’), (0.9099470973014832, ’non-republicans’), (0.9021103382110596, ’republicants’)

Table 10: Inspection of the non-abusive Reddit embeddings for the target words black, democrats and republicans by looking at their nearest neighbours.

When we take the target word ’black’, we see that in the biased Abusive Reddit embeddings it relates to race and skin color, but in the ’normal’ generic Reddit embeddings it relates to the color black. Furthermore, the political parties in the US (Republicans and Democrats) have some nicknames that are used to belittle them in their nearest neighbors for the biased Reddit embeddings. More results can be found in AppendixD.

(20)

4

_{M E T H O D}

This chapter presents the methods used to answer the research questions. First, we provide information about the data preparation for the training and test data. In the second section, the models are described and more details are given about the training process and evaluation metrics of the models. At last, we describe the setup of the experiments with the aforementioned models.

4.1 data preparation

4.1.1 Preprocessing

In order to get as much information from the data as possible, the messages have to go through a couple of preprocessing steps. Woo et al. (2020) evaluated

multi-ple combinations of preprocessing types in a text-processing neural network model. They conclude that building predictive models from refined data can help improve the performance of a model. They found that the use of normalization techniques such as lowering and punctuation splitting increased performance scores and rec-ommend to use these techniques in the creating and training process of a model. In addition to these suggestions, we incorporated five other preprocessing steps for the data. Username handles, URLs and numbers are converted to the tokens @USER, <URL> and <NUMBER>, respectively. This prevents infrequent tokens, with little to no information, from occurring in the vocabulary and helps us generalize better. Reddit supports the use of emojis and markdown language. Emojis can express pos-itive and negative emotions. By converting these emojis to their respective text form, we can use this extra information as features. Additionally, Reddit’s markdown lan-guage is also removed from the messages. At last, the message is tokenized. The same preprocessing steps are applied to the data used before creating the Reddit fastText embeddings.

Pre-processing Old Tokens New Tokens

Usernames handles @name @name @USER @USER URLs https://reddit.com <URL>

Numbers 10 93430 <NUMBER><NUMBER>

Emojis to text , :happy:

Markup cleaner **bold text** bold text

Lowering Donald Trump donald trump

Punctuation She is smart, but ugly!! She is smart but ugly

Table 11:Examples of the preprocessing steps taken for the creation of the Reddit embed-dings and filtering of training data.

4.1.2 Distant Training Sets

First, the distant data are preprocessed and training sets of different sizes are cre-ated and made ready to use as input for the models. These training sets are com-posed of messages and labels. Each new larger training set contains the same mes-sages that are in the smaller training sets, but with more data added to them. For example, we start with a distant set of 6000 messages. The next training set will contain 12000 messages. This set will contain the same comments that were used in

(21)

4.1 data preparation 16

the previous distant training set of 6000, but also an additional amount of 6000 new comments are added. The sizes of the various distant training data and the models they apply to are shown in Table12.

Models SVM Bi-LSTM Trainingsizes 6000 12000 12000 48000 24000 96000 48000 96000

Table 12: Sizes used for the distant training sets for the SVM and Bi-LSTM models

If data are chosen randomly and put into training sets of different sizes, there is no way to tell whether the change in performance comes from the increasing amount of data or from the different composition of messages in the training set. By keeping the messages from previous smaller distant training sets in the larger training sets, we can investigate whether the models are actually performing better because of the increasing amounts of data.

4.1.3 Label Distribution

In order to determine the label distribution of our distantly supervised training data, 5-fold cross-validation was run on batches of data with two kinds of distributions which are listed in Table 13. The first training set has a balanced distribution of 50/50 in terms of abusive and non-abusive labels. The second set is slightly more biased towards the abusive labels with a 66.66/33.33 distribution.

Distributions

Labels _Abusive _Non-abusive

Batch sizes Explicit Implicit Not Distribution 1 6000, 12000, 24000, 48000, 96000 25% 25% 50% Distribution 2 6000_{, 12000, 24000, 48000, 96000} 33_.33% 33_.33% 33_.33%

Table 13: Distributions of the labels for each batch

Three different SVM models, similar to the SVM model described in section4.2.1, with each their own feature selection algorithm, are trained using these datasets. The first model uses a TF-IDF vectorizer, the second uses embeddings from the self-made pre-trained Reddit embeddings and the last model uses embeddings gen-erated from the pre-trained GloVe embeddings. In addition to the 5-fold cross-validation, two distantly supervised test sets both with the same distribution as shown in Table13are introduced. The reasoning behind this is to investigate how the training sets perform on test sets with the same or a different distribution. These two extra test sets consist of abusive and non-abusive subreddits that were excluded from the training data. The macro-f1 score is then calculated for each test-fold and two distantly supervised test sets. After the cross-validation, the scores of the test folds and two test sets are averaged and returned to get reliable results. The detailed results can be found in AppendixE.

The macro-f1 scores of each batch for the distantly supervised test sets keep in-creasing the more data we feed into the models. Based on this interesting observa-tion, we can assert that the models are actually learning how to classify abusiveness across communities and not solely the communities that are in the training data. It is difficult to tell based on these results which label distribution would be better suited for our experiments. A common conception for datasets is that they have to try to mimic real life. Since non-abusive messages occur more often than abusive messages, it would be better to choose a balanced dataset. A study by (Swamy et al.,2019) questions this notion and shows that models trained on a dataset with

(22)

4.2 models & evaluation 17

a distribution of more non-abusive messages than abusive messages do not gener-alize as well to new sentences compared to models with a higher distribution of abusive messages. They suggest that a dataset with a bias towards abusive mes-sages is better suited for the task of abusive language detection. Therefore, this study is choosing a distant training distribution of 66.66% abusive messages and a distribution of 33.33% non-abusive messages for the experiments.

4.2 models & evaluation

This study implements a traditional and a neural network model for the detection of abusive language. The traditional model uses the Support Vector Machine (SVM) algorithm and neural model a Bidirectional Long Short-term Memory model with an attention layer (Bi-LSTM).

4.2.1 Support Vector Machine model

The SVM is a specific type of supervised ML method that intents to classify the data points by maximizing the margin among classes in a high-dimensional space (Pereira et al., 2009). The optimum algorithm is developed through a “training”

phase in which training data are adopted to develop an algorithm capable to dis-criminate between groups earlier defined by the operator, and the ’testing’ phase in which the algorithm is adopted to blind-predict the group to which a new percep-tion belongs (Orru et al.,2012). SVM is a representation of examples as points in

space, mapped to the examples of the separate classes that are divided by a fair gap that is as comprehensive as possible. The SVM classifies or separates the classes using the hyperplane concept (Zohra and D.R.,2019). New data is anticipated to

re-side to a category based on which re-side of the gap they fall on. The goal of an SVM is to find the best hyperplane that splits the data points into two components by maximizing the margin. This makes the SVM classifier a binary classifier and there-fore multiple SVM’s are needed to perform the task of multi-class abuse detection. Methods for applying SVMs to multi-class classification problems are decomposing the multi-class problems into many binary-class problems and incorporate many binary-class SVM (Cheong et al.,2004).

The traditional model chosen for this study is an SVM model and was built using the SVC classifier from the scikit-learn framework (Pedregosa et al., 2011).

We do not change any default parameters. While preprocessing the data, one way to create features is to convert text sequences into a Bag Of Words (BOW) and normalize them with Term Frequency-Inverse Document Frequency (TF-IDF). We experiment with word-level features using n-grams ranging from 1 to 2. In addition, we use embeddings to retrieve features from the text. The words in a message are either converted to embeddings coming from GloVe and our own generated Reddit embeddings. The average of these embeddings will function as a representation of the message.

4.2.2 Bi-LSTM model with Attention Layer

As our neural network model, we use a Bi-LSTM model in combination with an attention layer, as proposed byYang et al. (2016). LSTM was first introduced by Hochreiter and Schmidhuber(1997). They use cell states to process new

informa-tion. This way, an LSTM can use its input and forget gate to learn and forget selectively for shorter and longer periods of time, without modifying all the exist-ing information. LSTM models can handle input sequentially and therefore take word order into account. We combine this with a bidirectional model, which allows us to process the messages both forwards and backwards. The attention mechanism

(23)

4.2 models & evaluation 18

focuses on the words that have the most impact on classifying the corresponding label and gives them more weight.

In order to improve the quality of the data before feeding it into the network, some preprocessing steps are needed.

(a)Binary Bi-LSTM model (b)Multi-class Bi-LSTM model

Figure 1: Visualization of a binary and multi-class Bi-LSTM model with stacked Reddit em-beddings layer

The model is written with the keras library (Chollet et al.,2015) and is shown in

Figure1. As input for our Bi-LSTM, we use a padded vector representation of the preprocessed messages based on the word indexes coming from the tokenizer. A zero-padding algorithm is applied to the vectors such that every message has a max-imum length of 150 tokens. To initialize our embedding layer, we use the vocabulary of the training data, which leads to a matrix of size s * d, where s is the number of unique words in the training data and d is the embedding space dimension. We also use a zero-padding method here, to ensure that each embedding space dimension consists of the same amount of vectors. To make sure that the embedding layer does not read these zeros as a vector but as a padding value, we use the mask_zero pa-rameter of the embedding layer. Depending on the experiment described in section 4.3, the embedding layer of the neural model has to be initialized with the correct weights for the vocabulary. Experiments 1 and 3 use either the 300-dimensional pre-trained GloVe embeddings or the stacked 600-dimensional Reddit embeddings that consist of the 300-dimensional abusive embeddings and the 300-dimensional non-abusive embeddings. Experiment 2 only uses the 300-dimensional pre-trained GloVe embeddings and 300-dimensional abusive Reddit embeddings. The

(24)

embed-4.2 models & evaluation 19

ding weights are not updated during training since that might lead to overfitting (Karpathy and Fei-Fei, 2017). For each word in a message, the Bi-LSTM model

combines its previous hidden state and the current word’s embedding weight to compute a new hidden state. Next, the information is fed to the attention layer as introduced byYang et al.(2016). This layer puts more emphasis on the most

infor-mative words in the message and gives these more weight. This information is then fed to another dense layer of 64 neurons with relu activation after which it is sent to the output layer.

The output layer’s setup is determined by the classification task. For a binary classification model the output layer is made out of a single neuron with sigmoid activation that returns a value between 0 and 1. For the multi-class model, the out-put layer contains 3 neurons where each neuron represents a possible class. These neurons have a softmax activation function that returns probabilities for all three classes.

The binary and multi-class models use binary crossentropy and categorical crossen-tropy loss functions, respectively. Both use Adam as the optimizer.

4.2.3 Trainingprocess of the models

This study is dealing with quite an amount of different settings and parameters for our models. Therefore, we decided to run the code for the experiments on the Peregrine HPC cluster of the University of Groningen. The cluster can be used by scientists from the university. It is useful for solving computational problems for which a single PC is not powerful enough. The machines can be used to submit a large number of jobs where every job contains different settings for the parameters. To decrease the time necessary for testing the models on the test set, we save the tokenizer and a checkpoint of the models. This way, we can use the same tokenizer that is fitted on the training set for the test sets, so that the word indexes of the vocabulary are not changed. The SVM model is saved after it has processed all of the training data. Checkpoints are used for saving neural networks and are a snapshot of the model architecture and all the weights values at a certain stage. For the first epoch that is trained, we save our model architecture and weights to a checkpoint. For every successive epoch, the weights are updated in the file, but only if the validation loss on the development set is lower than the validation loss of the saved epoch. Although the epoch with the lowest validation loss often does not result in the highest validation macro-f1 score, we believe that the system with the lowest loss leads to the highest score on unseen data.

4.2.4 Evaluation

To determine whether community-based data can be used for the detection of abu-sive language, we do an extrinsic evaluation where we evaluate and compare the scores of traditional models to the scores of the more complex neural models. The models will be run on the different test sets after which we evaluate their perfor-mance on their Macro-F1 score which treats all possible classes as equally important. Additionally, we look at the precision and recall scores of all labels to find out how the models perform on abusive and non-abusive labels.

(25)

4.3 experiments 20

4.3 experiments

4.3.1 Using distantly supervised data

The purpose of the first experiment is to find out whether distantly supervised data from communities can be used as an effective way for the identification of abusive and non-abusive language.

The experiment is set up as follows; first the messages from the distant training sets are cleaned and preprocessed before features can be generated out of them. Then, a total of three possible approaches are used to extract features out of the words in the messages.

1. TF-IDF vectorizer

2. Word embeddings extracted from the pre-trained GloVe embeddings

3. Word embeddings extracted from the combined Reddit embeddings (abusive & non-abusive)

The Reddit embeddings are the combination of the abusive embeddings and the non-abusive embeddings, which both have 300 dimensions. When these embed-dings are stacked on top of each other they create a 600-dimensional vector for the representation of a word. After the extraction of features, the models are trained for either binary (Abusive, abusive) or multi-class (Explicit, Implicit and Non-abusive) classification. By splitting the abusive class into explicit and implicit abuse we can investigate whether the models are able to make distinctions between these types of abuse. First, an initial test is done with an SVM with a TF-IDF vectorizer to see whether the scores for binary and multi-class classification are improving the more data is fed into the model. Based on these results we choose the training size which we will use for the SVM and Bi-LSTM models that incorporate either GloVe or Reddit embeddings. These models will be evaluated on the Gold Reddit test set which comes from the same Reddit domain. Based on the results on the Reddit Gold test set we choose the best performing binary model that will be used to test on cross-domain test sets. The cross-domain test sets are OffensEval’19, AbusEval and OffensEval’20 which contain messages coming from the Twitter domain. 4.3.2 Testing performance of biased Reddit embeddings on gold data

The second experiment focuses on evaluating the performance of the biased abu-sive Reddit embeddings. In this experiment, the biased Reddit embeddings will be compared with the more global GloVe embeddings. A difference from the first experiment is that in this experiment we solely look at the Reddit embeddings com-ing from the abusive communities. By traincom-ing the models on gold data we reduce the chance of poor data quality being a factor for a potentially large difference in results.

The experiment is set up as follows: Instead of using the distant training sets, we use the gold training sets from OffensEval’19 or AbusEval as input for our models. Since OffensEval’19 only uses two labels, the models build with this training set are only suitable for binary classification of abusive language. Messages from AbusEval are annotated with three labels and can, therefore, be used for both binary and multi-class classification.

After the messages go through the previously mentioned processing steps, fea-tures are extracted in the form of 300-dimensional word embeddings coming from either the GloVe embeddings or Abusive Reddit embeddings. These features are then fed into the SVM and Neural Network for training. The evaluation is done on the test sets from OffensEval’19, AbusEval and OffensEval’20 which are all in-domain. Based on the difference in these scores, we can argue whether the use of

(26)

4.3 experiments 21

biased embeddings is better or worse compared to the more global GloVe embed-dings for the task of abusive language detection. Additionally, we test the models with gold data on Gold Reddit to see whether gold data outperforms models with distant data.

4.3.3 Supplementing gold data with distantly supervised data

The third experiment aims at discovering whether gold data combined with dis-tantly supervised data can be used to increase performance compared to solely using distant training sets. The gold data will function in such a way that they guide the models in the ’right’ direction when more distant data gets added to the model. The training data is partly composed of gold data coming from AbusEval. The other part will come from distant training sets of sizes described in section4.1.2. The data goes through the preprocessing steps and features are extracted from the data in the form of word embeddings with the 600 dimensional Reddit embeddings. These embeddings are fed into the following four models :

• Binary: SVM and Bi-LSTM

• Multi-class: SVM and Bi-LSTM

Testing will be done on the Gold Reddit, OffensEval’19, AbusEval and OffensE-val’20 test sets.

(27)

5

_{R E S U L T S}

In this chapter we will analyse the results of the experiments presented in section 4.3using the methodology explained in sections4.1and4.2.

5.1 models trained with distantly supervised data

5.1.1 Are we learning to detect abuse?

First, we ran some initial experiments with distantly supervised data. Figure 2 shows the results of training a baseline SVM with a TF-IDF vectorizer with various training sizes ranging from 6000 up to 96000. By first checking the performance on a TF-IDF vectorizer, we can determine whether models are actually learning to classify abuse from the distantly supervised data.

Figure 2: Macro-F1 scores for the binary and multi-class SVM models with TF-IDF vectorizer. The model is trained with distantly supervised training data and tested on the Gold Reddit test set.

Based on the results from Figure2, we see that the scores are improving the more data gets fed into the models during the training phase. The macro-f1 score of the binary SVM model starts at a low score of 0.3057 when training with a distant training size of 6000, but increases to 0.5627 when trained on 96000 Reddit messages. For the multi-class model, we also see an increase in the scores from 0.4059 to 0.5042. A distant training size of 96000 Reddit messages, coming from abusive and non-abusive communities, achieved the best results on the initial experiment. Therefore, we continue to report on scores using this size for the distant data. For more detailed scores of the models trained, we refer you to AppendixF.1.

(28)

5.1 models trained with distantly supervised data 23

5.1.2 Finding the best performing model

Tables14and15show how the SVM and Bi-LSTM models with GloVe and stacked Reddit (abusive + non-abusive) embeddings are performing on the Gold Reddit test set.

Binary Model Class P R Macro-F1

SVM + GloVe Embeddings NOT .91 .49 .6286 ABU .47 .90

SVM + Reddit Embeddings NOT .89 .61 .6825 ABU .52 .84

Bi-LSTM + GloVe Embeddings NOT .88 .52 .6271 ABU .47 .85

Bi-LSTM + Reddit Embeddings NOT .87 .65 .6948 ABU .54 .82

Table 14:Results of binary classification models trained with a distantly supervised training set of 96000 Reddit messages. Tested on the Gold Reddit test set. (P= Precision,R

= Recall)

Multi-class Model Class P R Macro-F1

SVM + GloVe Embeddings NOT .86 .64 .5388 IMP .20 .35 EXP .55 .73 SVM + Reddit Embeddings NOT .85 .72 .5689 IMP .23 .35 EXP .60 .71 Bi-LSTM + GloVe Embeddings

NOT .86 .65

.5647 IMP .23 .45

EXP .62 .68 Bi-LSTM + Reddit Embeddings

NOT .84 .64

.5511 IMP .22 .43

EXP .59 .68

Table 15:Results of multi-class classification models trained with a distantly supervised training set of 96000 Reddit messages. Tested on the Gold Reddit test set. (P= Precision,R= Recall)

We see from Table14that all models achieve a high precision score for detecting non-abusive messages, but lacks in recalling these. On the other hand, the models seem to recall a high amount of abusive messages, but often classify the abusive class incorrectly. This could be because the models end up being too sensitive to-wards classifying a message as abusive. When the models come across a message that most likely occurs to be non-abusive the models make very few mistakes. Simi-lar patterns occur for the multi-class models in Table15. Non-abusive messages get slightly lower precision scores compared to the binary classification models. How-ever, the recall is higher in all models. The multi-class models seem to still have trouble detecting implicit abuse since we only get precision scores around ≈ 0.21.

Additionally, we observe that the models with a combination of abusive and non-abusive Reddit embeddings are outperforming the GloVe embeddings in the binary classification task when looking at the Macro-F1 scores. This also goes for the SVM model in the multi-class task. However, here the Bi-LSTM model with GloVe performs better than the model the Reddit embeddings (0.5647 vs 0.5511).

The best performing binary model is a Bi-LSTM with Reddit embeddings and achieves a Macro-F1 score of 0.6948 on the Gold Reddit test set. Interesting enough, the best multi-class model is the SVM model with Reddit Embeddings that outper-forms the more complex Bi-LSTM models having a Macro-F1 score of 0.5689

(29)

5.2 testing effectiveness of biased abusive embeddings on gold data 24

5.1.3 Testing on cross-domain test sets

Table16lists the precision, recall and macro-F1 results of the binary Bi-LSTM model with Reddit embeddings trained on a distant training set of 96000 messages. The model scores a macro-F1 score of 0.5712 on OffensEval’19, 0.5299 on AbusEval and

0.4686 on OffensEval’20. When looking at the precision and recall scores we observe

a pattern across all datasets. The precision for the non-abusive labels is again very high, but it only recalls less than half of the non-abusive messages. For the abusive class, it recalls a great deal of abusive messages, but the precision is quite low. From this, we can conclude that the model fed with distant training data does not perform well on cross-domain test sets.

Test Set Class P R Macro-F1

OffensEval’19 NOT .89 .47 .5712 ABU .38 .84 AbuseEval NOT .92 .47 .5299 ABU .30 .85 OffensEval’20 NOT .74 .42 .4686 ABU .54 .82

Table 16:Detailed results of the binary Bi-LSTM with Reddit embeddings trained on a dis-tant training set of 96000 messages. Evaluated on the OffensEval’19, AbusEval and OffensEval’20 test sets. (P= Precision,R= Recall)

5.2 testing effectiveness of biased abusive

em-beddings on gold data

5.2.1 Binary models

For the second experiment, we wanted to find out what the effect is when we use gold data in combination with biased abusive embeddings and compare it with the more global generic GloVe embeddings. The models are tested on the OffensEval’19, AbusEval and OffensEval’20 and Gold Reddit test sets. Tables17and19show the final results for the binary and multi-class models, respectively.

Binary Model Training Set OFF’19 AE OFF’20 GR

SVM + GloVe embeddings OffensEval’19 .74 .68 .52 .75

AbusEval .54 .59 .47 .56

SVM + Abusive embeddings OffensEval’19 .71 .68 .52 .72

AbusEval .51 .55 .46 .51

Bi-LSTM + GloVe embeddings OffensEval’19 .79 .72 .52 .78

AbusEval .70 .74 .50 .72

Bi-LSTM + Abusive embeddings OffensEval’19 .79 .73 .52 .77

AbusEval .71 .74 .51 .69

Table 17:Macro-F1 results of binary classification models trained with gold data from Of-fensEval’19 or AbusEval. Evaluated on the OfOf-fensEval’19, AbusEval, OffensEval’20 and Gold Reddit test sets (OFF’19= OffensEval’19,AE= AbusEval,OFF’20= Offen-sEval’20,GR= Gold Reddit).

In Table17we observe that binary models with biased Abusive Reddit embed-dings are getting similar Macro-F1 scores when compared to models with GloVe embeddings. Models trained on gold data coming from OffensEval’19 are perform-ing better on the OffensEval’19 and OffensEval’20 test sets than models trained on AbusEval, which only perform better on the AbusEval test set. The best performing binary model is a Bi-LSTM with Abusive Reddit embeddings trained on the