Abusive Language Detection in Online User Content

(1)

Abusive Language Detection in Online

User Content

using

Polarized Word Embeddings

L.R.N. (Leon) Graumans Supervisor: Tommaso Caselli Master thesis Information Science L.R.N. (Leon) Graumans s2548798 July 30, 2019

(2)

A B S T R A C T

Anytime one engages online, there is always a serious risk that he or she may be the target of toxic and abusive speech. To combat such behaviour, many internet companies have terms of services on these platforms that typically forbid hateful and harassing speech. However, the increasing volume of online data requires that ways are found to classify online content automatically.

In this work, we present a way to automatically detect abusive language in on-line user content. We create polarized word embeddings from controversial social media data to better model phenomena like offensive language, hate speech and other forms of abusive language. We compare these polarized embedding repre-sentations towards more standard generic embeddings, which are in principle a representative of the English language.

Two machine learning models are used to measure the contribution of our po-larized word embeddings to the detection of abusive language. We found that the polarized Reddit embeddings, created with theFastText algorithm, proved supe-rior to the other source-driven representations. When applied in a bidirectional LSTM model, our polarized embeddings outperformed the pre-trained generic em-beddings, even when used across multiple data sets.

The used data sets and the source code of this thesis can be foundhere. (https://github.com/ley0nn/ma_thesis).

(3)

C O N T E N T S

Abstract i

Preface iv

1 _introduction 1

2 _background 3

2.1 A Brief History of Word Embeddings . . . 3

2.2 Related Work in Abusive Language Detection . . . 4

2.2.1 Data sets and phenomenons of abusive language . . . 4

2.2.2 Work in abusive language detection . . . 5

3 _{data and material} 8 3.1 Data for Word Representations . . . 8

3.1.1 Collection . . . 8 Twitter . . . 8 Reddit . . . 9 4chan . . . 10 8chan . . . 10 Other Sources . . . 11 3.1.2 Creation of Embeddings . . . 11 GloVe Algorithm . . . 11 FastText Algorithm . . . 12

3.2 Data Sets for Training and Testing . . . 13

3.2.1 HatEval . . . 13

3.2.2 OffensEval . . . 14

3.2.3 WaseemHovy . . . 14

3.2.4 Comparison of Data Sets . . . 15

4 _method 17 4.1 Best Algorithm for Polarized Embeddings . . . 17

4.1.1 Simple Model Approach . . . 17

4.1.2 Subword Information for OOV-Words . . . 17

4.1.3 Baseline Model . . . 18

4.2 Building a Better Architecture . . . 18

4.2.1 Stacking the Embeddings . . . 18

4.3 Across Data Sets with Polarized Embeddings . . . 19

4.4 Intrinsic Evaluation of Polarized Embeddings . . . 19

5 _{results and discussion} 22 5.1 Results . . . 22

5.1.1 Best Algorithm for Polarized Embeddings . . . 22

5.1.2 Building a Better Architecture . . . 23

5.1.3 Across Data Sets with Polarized Embeddings . . . 25

5.2 Discussion . . . 29

5.2.1 Best Algorithm for Polarized Embeddings . . . 29

5.2.2 Building a Better Architecture . . . 29

5.2.3 Across Data Sets with Polarized Embeddings . . . 30

6 _conclusion 31

(4)

CONTENTS iii

7 _appendices 36

7.1 List of Keywords used as Twitter Hashtag . . . 36

7.2 K Nearest Neighbours . . . 38

7.2.1 KNN - FastText Generic and Twitter Polarized . . . 38

(5)

P R E F A C E

It has been a busy semester. Many (group) meetings, lots of lines of code and a countless number of programming errors. Looking back, however, it has been one of the most interesting and rewarding projects I ever undertook.

I would like to thank my supervisor Tommaso Caselli for his guidance and sup-port during the past half year. I would also like to thank my fellow students for their input and feedback during group meetings. Finally, I owe thanks to Roy David, Selina Postma and my parents.

Groningen, July 2019

(6)

1

_{I N T R O D U C T I O N}

Anytime one engages online, whether on message board forums, comments, or social media, there is always a serious risk that he or she may be the target of ridicule and even harassment (Nobata et al.,2016). Online harassment has been a problem to a greater or lesser extent since the early days of the internet (Kennedy et al.,2017). To combat such behaviour, many internet companies have terms of services on these platforms that typically forbid hateful and harassing speech. They also employ human editors to catch bad language and thus remove a post. However, the increasing popularity of social media platforms has seen a well-acknowledged rise in the presence of toxic and abusive speech on these platforms (Kshirsagar et al.,2018). The increasing volume of data requires that ways are found to classify online content automatically (Merenda et al., 2018; Nobata et al., 2016; Kennedy et al.,2017).

While there has been a growing interest in automatic methods for abusive lan-guage detection in social media, there is still a lack of consensus on what currently abusive language is (Merenda et al.,2018). We therefore followFounta et al.(2018) and consider abusive language as “[a]ny strongly impolite, rude or hurtful lan-guage using profanity, that can show a debasement of someone or something, or show intense emotion.” (Founta et al.,2018).

In order to better model online abusive language, we will use word embedding representations. Word embeddings are an example of successful applications of unsupervised learning. Their main benefit arguably is that they can be derived from large unannotated corpora that are readily available. However, one of the major downsides of using pre-trained embeddings is that the news data used for training them is often very different from the data on which we would like to use them (Ruder,2017).

Previous study shows that "online communities tend to reinforce themselves, enhancing "filter bubbles" effects, decreasing diversity, distorting information, and polarizing socio-political opinions" (Merenda et al.,2018). Communities thus repre-sent different sources of data. By targeting the right communities for this task, we could be able to acquire the right amount of controversial data. Instead of targeting such communities, however, we select and scrape certain online forums on Reddit, 4chan and 8chan, with the assumption to find controversial data within these fo-rums. In addition, we target controversial Twitter messages directly by looking for a specific hashtag. By acquiring this controversial data, we create source-driven repre-sentations from which we can generate polarized word embeddings using theGloVe andFastTextalgorithms. We use these embedding representations in a variety of models, using three data sets from the literature that cover different dimensions of abusive language to train and test these models. We then compare the contribution of our (potentially) polarized representations to more standard generic embeddings and self-created generic embeddings from these domains.

Our formalized research questions read:

1 How can we create polarized embeddings to better model abusive language detection?

1.1 How do we collect polarized data?

1.2 How can we evaluate the "goodness" of our polarized word representations? 1.3 Do the sources used to generate polarized embeddings influence performance?

(7)

introduction 2

In the next chapter, word embeddings and abusive language detection in general will be discussed to gain a better understanding of the work that was previously done. Further, we will discuss the data we used: i.) to generate polarized word em-beddings; and ii.) as training and test data for our machine learning classifiers. We will then motivate our methodology, as well as the intrinsic and extrinsic evaluation of the word embeddings. Next, we show and discuss our results. Last, based on the results, we state our conclusion.

(8)

2

_{B A C K G R O U N D}

2.1 a brief history of word embeddings

Within this chapter, we explain previous research that has been done in the field of our research question. Before we can start looking at other approaches of detecting online abusive language, we have to dive into the history of word embeddings, to learn and understand how they work and what they are initially being used for.

Word embeddings aim to capture syntactic and semantic word similarities based on their distributional properties in large samples of language data. The term was coined in 2003 byBengio et al.(2003), where they trained them in a neural language model containing a one-hidden layer feed-forward neural network that predicted the next word in a sequence. After fifteen years, the general building blocks of their model are still found in all current neural language and word embedding models (Ruder,2016). These building blocks consist of: i.) the Embedding Layer which gen-erates word embeddings by multiplying an index vector with a word embedding matrix; ii.) the Intermediate Layer(s) which produce some intermediate representa-tion of the input; and iii.) the Softmax Layer that produces a probability distriburepresenta-tion over words in the vocabulary. Computing the softmax was accompanied by high computational costs. Since the lack of computing power at that time, research in word embeddings stagnated (Ruder,2016).

In 2008, Collobert and Weston (2008) are the first to show the advantages of pre-trained word embeddings. By leveraging a large amount of unlabeled text data, they induced word embeddings, carrying syntactic and semantic meaning, which were shown to boost generalisation performance on downstream tasks. To avoid computing probabilities using the expensive softmax, they just produce a score for the next word. Apart from removing the softmax layer, the architecture ofCollobert and Weston(2008) remains the same asBengio et al.(2003).

Five years later, Mikolov et al. (2013) introduced Word2Vec, arguably the most popular word embedding model that led to the popularisation of word embeddings. They elaborate upon their models in their paper. Here,Mikolov et al.(2013) propose two novel model architectures for computing continuous vector representations of words from very large data sets. These architectures eliminate the non-linear hidden layer and thus try to minimise computational complexity, which is arguably one of its main benefits over the models ofCollobert and Weston(2008) andBengio et al. (2003).

The first proposed architecture is the Continuous Bag-of-Words (CBOW) Model. Instead of predicting the next word by looking at words in the past, this model uses a window of n words around the target word (i.e. words in the history and the future of the target word are used to make a prediction). The second archi-tecture, called the skip-gram model, is similar to CBOW, but instead of predicting the current word based on the context, it uses each current word as an input to predict surrounding words within a certain range. They found that increasing the range improved quality of the resulting word vectors, but also led to an increase of computational complexity (Mikolov et al.,2013).

BeforeMikolov et al.(2013), word vectors were often evaluated using more sim-ple similarity comparisons. For examsim-ple, the word France being similar to Italy and perhaps some other countries. Mikolov et al. (2013) subjected their vectors in a more complex similarity task, looking at different types of similarities between words. For example, for finding a word that is similar to small in the same sense as biggest is similar to big, they compute a vector by subtracting the vector of big

(9)

2.2 related work in abusive language detection 4

vec(biggest ) −vec( big ) +vec( small ) ≈ vec( smallest ) vec( king ) −vec( man ) +vec( woman ) ≈ vec( queen ) vec( France ) −vec( Paris ) +vec( Germany ) ≈ vec( Berlin ) vec( Feyenoord ) −vec( Rotterdam ) +vec( Amsterdam ) ≈ vec( AJAX )

Table 1: Examples ofWord2Vecanalogy relationships.

from biggest and add this to the vector of small. Then, they search within the vector space for the word closest to the resulting vector, and use the answer to the question. Some examples of relationships between words can be found in Table1.

A year later, in 2014,Pennington et al.(2014) released_GloVe1, a competitive set of pre-trained word embeddings that takes a different approach. Where bothCBOW andskip-grammodels only take local contexts into account,GloVetakes advantage of global context. The model learns vectors of words from their co-occurrence in-formation, i.e. how frequently they appear together in a large text corpora. At this point,GloVeoutperforms models on word analogy, word similarity and named entity recognition tasks (Pennington et al.,2014).

One of the main problems using pre-trained word embeddings is that they are unable to deal with out-of-vocabulary (OOV) words (Ruder,2017). Typically, OOV-words are omitted or assigned with an placeholder or a n-dimensional vector of zeroes. All of the former methods are ineffective if the number of OOV words is large. One way to mitigate this issue is the use of subword-level embeddings. Word embeddings have been augmented with subword-level information for many applications such as named entity recognition, part-of-speech tagging, dependency parsing, and language modelling (Ruder,2017). For incorporating character infor-mation into pre-trained embeddings, however, character n-grams features have been shown to be very powerful. In their work,Bojanowski et al.(2017) propose a new approach based on the skipgram model ofMikolov et al.(2013), where each word is represented as a bag of character n-grams; words being represented as the sum of these representations. One advantage is that words which did not appear in the training data can be provided with a word representation, resulting in a better coverage and state-of-the-art performance on word similarity and analogy tasks.

2.2 related work in abusive language detection

2.2.1 Data sets and phenomenons of abusive language

The evaluation of pre-trained word embeddings if often executed on intrinsic tasks such as word similarity for comparison against previous embedding approaches. It is questioned whether such intrinsic evaluation can predict the merits of the rep-resentations for downstream tasks (Chiu et al., 2016). The RepEval Workshop at ACL 20162

exclusively focused on better ways to evaluate pre-trained embeddings. What emerged from this workshop, is that the best way to evaluate embeddings is extrinsic evaluation on downstream tasks, which in our case represents detecting abusive language in online content.

When speaking of abusive language, there are different phenomenons. Such language could for example contain swearing, threats, hate and inappropriate terms. Many efforts have been made to classify forms of abusive language, for example, hate speech, using scraped data from online message forums and popular social media sites (e.g. Twitter and Facebook). Waseem and Hovy(2016) provide a list of criteria founded in critical race theory, and use them to annotate a corpus of more than sixteen thousand tweets for hate speech, divided in tweets annotated for racist

1

https://nlp.stanford.edu/projects/glove

2

(10)

content, sexist content or neither sexist or racist content. We re-use their corpus in this work. After the annotation process, Waseem and Hovy(2016) investigated which features better assist the detection of hate speech by building upon this data set. They show that only gender plays an important role, while geographic and word-length distributions are typically ineffective.

Davidson et al.(2017) provide a data set that has also been studied widely. The focus of this work was mainly to distinguish between hateful and offensive lan-guage. They found that offensive language contains offensive terms which are not necessarily inappropriate, while hate speech is used to expresses hatred towards a targeted group or is intended to be derogatory, to humiliate, or to insult the members of the group (Davidson et al., 2017). They find that certain terms are particularly useful for distinguishing between hate speech and offensive language. Many of the most hateful tweets contain multiple homophobic and/or racial slurs, meaning it is easy to misclassify hate speech if a given tweet does not contain any curse words or offensive terms.

As a consequence of the growing interest in automatic methods for abusive language detection in social media, shared evaluation exercises are employed, such as the GermEval 2018 Shared Task 3

, the SemEval 2019 Task 5: HateEval 4

and Task 6: OffensEval5

, and the EVALITA 2018 Hate Speech Detection task6

.

The SemEval 2019 Task 5: HatEval is about the detection of hate speech against immigrants and women in Spanish and English messages extracted from Twitter (Basile et al.,2019). In this work, we re-use the HatEval data set for further research. The task is divided in two related subtasks for both languages: a basic task about hate speech, and another one where fine-grained features of hateful contents will be investigated in order to understand how existing approaches may deal with the identification of especially dangerous forms of hate, for example those where the incitement is against an individual rather than against a group of people, and where an aggressive behaviour of the author can be identified as a prominent feature of the expression of hate (Zhang et al.,2019).

We also re-use the data set provided with SemEval 2019 Task 6: OffensEval. This shared exercise is on identifying and categorising offensive language in social me-dia, featuring three sub-tasks. In sub-task A, the goal was to discriminate between offensive and non-offensive posts. In sub-task B, the focus was on the type of offen-sive content in the post. Finally, in sub-task C, systems had to detect the target of the offensive posts (Zampieri et al.,2019b).

2.2.2 Work in abusive language detection

Additional projects, including this work, have built upon these data sets in order to find these different phenomenons of abusive language.

For example the work of Malmasi and Zampieri (2017), where they examine methods to detect two phenomenons of abusive language in social media, namely hate speech and offensive language. In their work they use the hate speech data set byDavidson et al. (2017), where messages were annotated as ’hate’, ’offensive’ or ’not hateful or offensive’. They applied a character 4-gram model using a linear SVM classifier, resulting in an accuracy of 78%, showing that distinguishing profanity from hate speech is a very challenging task.

More work concerning hate and abusive speech detection is that of Lee et al. (2018). They conducted a comparative study of various learning models on the ’Hate and Abusive Speech on Twitter’ data set byFounta et al.(2018), and discuss the possibility of using additional features and context data for improvements. This data set contains 100K tweets which are classified into four labels: ‘normal’, ‘spam’,

3 https://bit.ly/31Ss5P6 4 https://bit.ly/2EEC7Me 5 https://bit.ly/2P7pTQ9 6 https://bit.ly/2XAzeE8

(11)

‘hateful’ and ‘abusive’. They replaced IDs, URLs and frequently used emoji’s with a special placeholder. As hashtags tend to have a high correlation with the content of the tweet, they used a segmentation library for hashtags to extract more informa-tion. For example, the hashtag ‘#makeamericagreatagain’ would be processed into ‘make’, ‘america’, ‘great’, and ‘again’.Lee et al.(2018) implemented more traditional machine learning models, for example tf-idf features used in a Logistic Regression model, and neural network based models like a bidirectional GRU network. The latter proved to be most accurate in this task. However, results were not satisfac-tory, due to "the lack of labeled data as well as the significant imbalance among the different labels" (Lee et al.,2018).

In this work, we focus on the contribution of word embeddings on these down-stream tasks. Word embeddings are an example of successful applications of un-supervised learning, and have been widely used in the field of Natural Language Processing, also for detecting forms of abusive language.

For example, looking at the best performing teams at the aforementioned shared evaluation exercises, we observe that most of these teams employ (pre-trained) word embeddings. At the SemEval 2019 Task 5: HatEval (Basile et al.,2019), theIndurthi et al.(2019) team with the higher macro-averaged F1-score employed a SVM model with RBF kernel, exploiting sentence embeddings from Google’s Universal Sentence Encoder7

. The third, fourth and fifth team exploitedFastText,BERT(Bidirectional Encoder Representations from Transformers) (Devlin et al.,2018) and_GloVeword embeddings respectively.

Among the top-10 teams at the SemEval 2019 Task 6: OffensEval, seven teams exploited the BERT(Devlin et al.,2018) embeddings with variations in the param-eters and in the preprocessing steps. "The top non-BERTmodel employed Twitter Word2Vecembeddings and token/hashtag normalization" (Zampieri et al.,2019b).

Other use of word embeddings for detecting hate speech is that of Kshirsagar et al.(2018), where they use the pre-trained 300 dimensional_GloVeCommon Crawl Embeddings (840B Token)(Pennington et al.,2014) for classifying online hate speech in general, as well as racist and sexist speech in particular. They used three data sets to train and evaluate their classifier, which were the Sexist/Racist data set by Waseem and Hovy(2016), the Hate data set byDavidson et al. (2017) and the Harrassing data set byGolbeck et al.(2017).

Kshirsagar et al.(2018) propose a Transformed Word Embedding Model (TWEM), which relies on a simple word embedding (SWEM)-based architecture byShen et al. (2018).Kshirsagar et al.(2018) modify the SVEM-concat architecture to allow better handling of infrequent and unknown words and to capture non-linear word combi-nations. First, to better handle rare or unseen tokens in de training data, they trans-form each word embedding by applying a Multi Layer Perceptron with a Rectified Linear Unit (ReLU) activation to form an updated embedding space. They use two pooling methods on the updated embedding space. To capture salient word features they employ max pooling, while using the average pooling to capture the overall meaning of the sentence. Kshirsagar et al. (2018) then concatenate both vectors to form a document representation. Despite minimal tuning of hyper-parameters, fewer parameters, minimal text preprocessing and no additional metadata, their model performs well on standard hate speech data sets.

Instead of using pre-trained word embeddings,Merenda et al.(2018) based their approach on previous study which show that "online communities tend to rein-force themselves, enhancing "filter bubbles" effects, decreasing diversity, distorting information, and polarizing socio-political opinions". Communities then represent different sources of data, and thus, by targeting the right community (e.g. white supremacists groups), it seems easier to extract data that contains a stronger bias and polarisation than by using controversial topics (Merenda et al., 2018). Build-ing on this principle, they scraped data from selected social media communities on

7

(12)

Facebook, which may promote or be the target of hate speech, such as pages known for promoting nationalism, controversies, hate against migrants and other minori-ties, support for women and LGBT rights. From this data, they generated polarized, hate-rich word embeddings using the Word2Vec skip-gram model (Mikolov et al., 2013). These hate-rich embeddings were used in models for hate speech detection, comparing them to larger, pre-trained generic embeddings that were trained on the Italian Wikipedia usingGloVe8

(Berardi et al.,2015). Despite their embeddings be-ing almost 25 times smaller than the generic ones, they yield a substantially better performance for classifying hate speech. Their work proves the utility of polarized representations for such tasks. Our approach relies on that ofMerenda et al.(2018), but instead of targeting online communities, we collect our controversial data tar-geting hand-picked keywords and specific online discussion boards. In addition to Twitter, we will use multiple online sources such as Reddit, 4chan and 8chan, and evaluate our polarized word embeddings on different data sets with Twitter as their domain, and thus use our models both on in-domain as cross-domain data.

Domain adaptation can also be found in the work ofKaran and Šnajder(2018). They investigate to what extent models trained to detect general abusive language can generalise between different data sets, each labeled with different abusive lan-guage types. They compare the cross-domain performance of simple classification models on nine different data sets, including data fromWaseem and Hovy(2016), and find that the models fail to generalise to different-domain data sets, even when trained on a much larger out-domain data. This indicates that having in-domain data, even if not much of it, is crucial for achieving good performance on this task (Karan and Šnajder,2018).

8

(13)

3

_{D A T A A N D M A T E R I A L}

During this study, data is used in two ways in the context of abusive language detection, namely to generate polarized word embeddings and as training and test data for the extrinsic evaluation of these embeddings.

3.1 data for word representations

One of the major downsides of using pre-trained word embeddings is that the data used to generate them is often very different from the data on which we would like to use them (Ruder, 2017). Therefore, we create polarized embeddings built on a corpus which is not randomly representative of the English language, rather collected with a specific bias.

In this section, we will answer subquestion 1.1: How do we collect polarized data? We describe which controversial data we use to generate our embeddings, how we collect this data and how we process this data for further usage.

3.1.1 Collection

In their work, Merenda et al. (2018) selected a set of publicly available Facebook pages to target certain communities which may promote or be the target of hate speech. Our approach relies on this idea, but on the contrary, we target different aspects for retrieving controversial data. Instead of targeting an online community, we target the social networking service Twitter and online discussion boards Red-dit, 4chan and 8chan. From these boards, we select multiple sub-boards. We do not know if these boards contain abusive speech, but we assume they contain discus-sions with a lot of controversy. In addition, to collect data outside online communi-ties and discussion boards, we target abusive language directly by scraping Twitter comments containing a specific hashtag.

Using multiple data sources results into using multiple domain representations, which we later use for cross-testing between different sources. We aim to collect at least 100 million tokens per source, so that each of our word embeddings ends up with a decent vocabulary size.

Twitter

Instead of scraping Twitter data ourselves, we used an already available Twitter data set collected by the University of Groningen. This collection consists of 1.385.132.999 tweets which were scraped during a period of exactly one year, 2016 to be precise.

Rather than targeting a specific community, we filter the Twitter data using cer-tain keywords. We came up with a list using the following websites; Procon debate topics 1

and the List of controversial issues on Wikipedia2

. From these websites, we manually selected 310 keywords, a copy of this list can be found in Appendix 7.1.

Using these keywords, we filtered the Twitter data set, only keeping the com-ments which contain at least one of the keywords as a hashtag. Out of 1.385.132.999, we ended up with a collection of 6.207.344 polarized tweets containing a total

1

https://www.procon.org/debate-topics.php

2

https://en.wikipedia.org/wiki/Wikipedia:List_of_controversial_issues

(14)

3.1 data for word representations 9 count % hashtag 801176 12,9 #trump 607496 9,8 #maga 471564 7,6 #brexit 410401 6,6 #tcot 318673 5,1 #blacklivesmatter 252528 4,1 #auspol 249163 4,0 #feelthebern 243496 3,9 #nevertrump 189127 3,1 #makeamericagreatagain 166494 2,7 #trump2016

Table 2: The ten most occurring hashtags in our selected tweets, sorted by the amount of

tweet occurrences and its percentage.

amount of 132.978.104 tokens (Table 3). The ten most occurring hashtags can be found in Table2. For this source, we do not scrape "general", non-polarized data, since we want to compare embeddings created from these tweets against already availableGloVegeneric Twitter embeddings.

Reddit

The second source we scrape is Reddit, an online forum where people can start conversations and discussions. These discussions are divided into subreddits. Each subreddit focuses on a particular subject, and any user (called redditor) can create one; subreddit creators automatically become moderators on those subreddits. For example, there are subreddits dedicated to soccer, origami or more educational topics like Natural Language Processing. But, since there are boards for all topics you can think of, there is a variety of subreddits dedicated to controversial topics, like white supremacy, Islamophobia and anti-feminism.

polarized reddit data For the collection of controversial Reddit content, we used a list found on Rationalwiki.org 3

. This page provides a list of 48 subred-dits containing a lot of abusive speech, like hate and racism. One example is /r/conspiracy; “The front page of paranoia on the Internet. Run by a white supremacist moderator who worships Timothy McVeigh, they made Hitler Did Noth-ing Wrong their official documentary long before Trump ever considered runnNoth-ing for president. Theories discussed here run the gamut from 9/11 to chemtrails to lizard people ” (RationalWiki,2018).

generic reddit data Unlike Twitter, there are no generic word embeddings created from Reddit data publicly available. In order to collect non-polarized com-ments, we used a list of the 31 most subscribed subreddits4

, i.e. the most popular subreddits.

scraping reddit data For scraping the Reddit comments we use PRAW: The Python Reddit API Wrapper5

. WithPRAW, we can access our list of subreddits and its comments. Some of the 48 subreddits are quarantined, meaning they are not pub-licly available. However, using thePRAWtool, we are still able to access these banned forums. We are able to scrape 2.763.575 polarized comments (containing 96.093.957 tokens) and 4.273.990 more general messages (containing 148.613.530 tokens), as is shown in Table3. 3 https://rationalwiki.org/wiki/Reddit 4 https://medium.com/@davis1/the-31-biggest-subreddits-f95c1f1f5e97 5 https://praw.readthedocs.io/

(15)

3.1 data for word representations 10

4chan

The third source we scrape is 4chan. 4chan is an image board, originally meant for people to post and share images. One could compare it to Reddit, but the key differ-ence is that it is anonymous. People can post anything without it being associated with them, which is why there is a lot of cursing, insults, racism, sexism and other vulgar behaviour. This makes 4chan a perfect source for scraping controversial data. polarized 4chan data Just like Reddit, 4chan is divided into subparts, which are called boards. 4chan has 70 boards, each dedicated to a different topic. A few of those 70 are well-known for their vulgarity. For example/pol/, the Politically Incorrect board, which is home to a great deal of racism and politically incorrect ideas. Obviously not all of its users fit that mold, but enough of them do. 4chan has more than one board which could contain hateful, controversial content, but since /pol/is the most popular one, we only use this part.

generic 4chan data While 4chan is home of the most vulgar discussions, there are other boards that are actually fairly tame, like/v/for video games and/mu/for music. Just like Reddit, we scrape a few of these "neutral" boards for the creation of generic 4chan embeddings. Examples of these board are/a/(Anime & Manga),/g/ (Technology) and/p/(Photography). While we assume these boards contain less profanity, we have to keep in mind that they still come from 4chan, and thus tend to be controversial.

scraping 4chan data For scraping 4chan comments we use BASC-py4chan, a Python library that gives access to the 4chan API and an object-oriented way to browse and get board and thread information quickly and easily6

. But, there is one limitation. This tool can only scrape threads (discussions) which are not archived yet. Per board, the 150 most active threads stay visible on the homepage, and can thus be accessed by users. If a thread is not being used anymore, it gets discarded and ends up in their archive. Expired threads within this archive stay visible for three days, before they disappear from their website.

Using BeautifulSoup 47

, a Python-package for parsing HTML and XML doc-uments, we scrape all boards and their archives every three days, till we reach the right amount of tokens. Using this method, we achieved to scrape 3.800.235 polar-ized 4chan comments (containing 96.074.523 tokens) and 3.124.981 general messages (containing 98.703.507 tokens) (Table3).

8chan

The fourth source we scrape is 8chan. 8chan is the same kind of image board like 4chan, but it is even a bit more extreme. For example, on Marc 15, 2019, there were two consecutive terrorist attacks at mosques in Christchurch, New Zealand. The shooter shared links to the live stream video only minutes before the attack on 8chan and on Facebook. Some members of 8chan re-shared it and applauded the violent murders.

scraping 8chan data On 8chan, there is a lot of hatred, racism and sexism, making it a interesting source for collecting controversial content. Just like 4chan, 8chan is a collection of boards which can be scraped with a Python library, called py8chan8

. Since there is hardly any "neutral" data to be found on this website, we only scrape data from the most controversial boards, from which we later create polarized 8chan embeddings. We ended up with a collection of 473.267 comments containing 18.500.194 tokens (Table3).

6 https://basc-py4chan.readthedocs.io/ 7 https://pypi.org/project/beautifulsoup4/ 8 https://pypi.org/project/py8chan/

(16)

collection comments tokens

Twitter polarized 6.207.344 132.978.104 Reddit polarized 2.763.575 96.093.957 Reddit generic 4.273.990 148.613.530 4chan polarized 3.800.235 96.074.523 4chan generic 3.124.981 98.703.507 8chan polarized 473.267 18.500.194

Table 3: Distribution of the scraped data from Twitter, Reddit, 4chan and 8chan.

Other Sources

Just like Merenda et al. (2018), we would like to target specific communities for collecting polarized source-driven representations. We intend to use Facebook and GAB. Unfortunately, the GraphAPI 9

, which is normally used for scraping Face-book pages, is not working during the period we scrape our data. We therefore omit Facebook. GAB10

is an English-language social media website, known for its mainly far-right user base. The site has been described as "extremist friendly" or a "safe haven" for neo-nazis, white supremacists, and the alt-right. Scraping GAB could lead to some interesting results. However, for accessing the GAB website, one should get invited. We therefore omit GAB.

3.1.2 Creation of Embeddings

Over the acquired data (Table3) we build multiple distributed representations us-ing the GloVe and FastText algorithms. Within this subsection we describe each algorithm, give the used parameters and preprocessing steps and report the final models.

GloVe Algorithm

One of the algorithms we use for embedding creation isGloVebyPennington et al. (2014). GloVetakes advantage of global context. The model learns vectors of words from their co-occurrence information, i.e. how frequently they appear together in our acquired data. Before we can generate theGloVeembeddings, we employ the same preprocessing steps asPennington et al.(2014) took. In their paper, they do mention which preprocessing steps they take before generating the Common Crawl word embeddings. However, they used a different approach for theGloVe Twitter embeddings, found in their Ruby-script11

, which we re-use. Here, they use a few regular expressions for replacing text with placeholders. The preprocessing steps can be found in Table4, e.g. the following tweet gets transformed like this:

input

output for tonight’s <hashtag> dem debate, be sure to follow <user>, <user>, <user>, <user>, <user>, and <user> for rapid response, fact checking, and the truth! <hashtag> maga <allcaps> <hashtag> kag <allcaps> sorry, i’m on air force one, off to save the free world!

9 https://developers.facebook.com/docs/graph-api/ 10 https://gab.com/ 11 https://nlp.stanford.edu/projects/glove/preprocess-twitter.rb

(17)

placeholder input output

<URL> https://www.rug.nl <URL> <USER> @realDonaldTrump <USER>

<HASHTAG> #blacklivesmatter <HASHTAG> blacklivesmatter

<ELONG> wayyyy way <ELONG>

Table 4: Preprocessing steps ofGloVeembeddings byPennington et al.(2014).

GloVeembedding minimal frequency context window vocabulary size

Twitter generic 5 10 1.200.000 Twitter polarized 5 10 265.459 Reddit generic 5 10 320.058 Reddit polarized 5 10 378.986 4chan generic 5 10 357.136 4chan polarized 5 10 285.024 8chan polarized 5 10 93.934 Twitter polarized 1 5 1.242.949 Reddit generic 1 5 3.462.839 Reddit polarized 1 5 2.242.769 4chan generic 1 5 4.420.517 4chan polarized 1 5 3.437.450 8chan polarized 1 5 673.583

Table 5: Change in vocabulary size when changing the minimal frequency and context

win-dow for theGloVealgorithm.

In order to compare our embeddings with the genericGloVeTwitter embeddings, we used the same parameters. Embeddings were generated using 200 dimensions, a xmaxof 100 and an α of 3₄.Pennington et al.(2014) use a minimal word frequency of 5 and a context window of 10. The latter applies to the number of words one will use to determine the context of each word. The originalGloVeTwitter embeddings are generated using 27 billion tokens, resulting in a vocabulary of 1.2 million words. Using the same parameters, we yield a fairly small vocabulary from our data collec-tions, thus leading to a smaller word coverage in further experiments. We therefore choose to use a lower minimal frequency of 1 and a context window of 5, the differ-ences in vocabulary size are shown in Table 5. For further usage in this work, we omit the six embeddings with a minimal frequency of 5 and a context window of 10, and use the embeddings with the bigger vocabulary.

FastText Algorithm

The second algorithm for generating word embeddings isFastText, an approach by Bojanowski et al.(2017) based on the_skipgrammodel ofMikolov et al.(2013), where each word is represented as a bag of character n-grams; words being represented as the sum of these representations. Bojanowski et al.(2017) mention one advantage ofFastTextbeing able to provide a word with a representation, even if that specific word does not appear in the training data, resulting into a better word coverage.

For our FastText model we adopt the parameters used by Bojanowski et al. (2017): the word vectors have 300 dimensions, we use a context window of size 5 and when building the word dictionary we keep the words that appear at least 5

(18)

3.2 data sets for training and testing 13

FastTextembedding minimal frequency context window vocabulary size

Wiki News generic 5 5 1.000.000

Twitter polarized 5 5 283.360

Reddit generic 5 5 295.028

Reddit polarized 5 5 219.673

4chan generic 5 5 350.680

4chan polarized 5 5 268.583

Table 6: Final word embeddings using theFastTextalgorithm.

times in the training set. The final models and their corresponding vocabulary size are shown in Table6. We omit 8chan as a source for_FastText, since its low amount of tokens resulted in unreliable embeddings.

We compare our models toFastText’s 1 million word vectors trained on trained on Wikipedia 2017, UMBC webbase corpus and statmt.orgnews dataset (16B to-kens). Bojanowski et al.(2017) do not mention any preprocessing steps in their pa-per. Since most of the usernames and hyperlinks occur only once, we replace them with a placeholder. <USER> for usernames, <HASHTAG> + the actual hashtag-text for hashtags and <URL> for hyperlinks, just like we do with theGloVeembeddings.

3.2 data sets for training and testing

Within this section we describe each data set we use during the extrinsic evaluation part of this study. The three data sets we use all address a different phenomenon of abusive language. Two of them address the category of hate speech, while the other addresses the category of offensive language. The domain, however, is the same for all three data sets, which is Twitter. At the end of this section we provide the label distribution of each data set and use a tool to measure the (dis)similarity between them.

3.2.1 HatEval

Basile et al.(2019) employed different approaches to collect tweets: (1) monitoring potential victims of hate accounts, (2) downloading the history of identified haters and (3) filtering Twitter streams with keywords, i.e. words, hashtags and stems. The data have been collected in the time span from July to September 2018, with the exception of data with target "woman", which for most part has been derived from an earlier collection for the EVALITA 2018 Hate Speech Detection task12

. The entire HatEval data set is composed of 19.600 tweets, from which 13.000 are English.

The data set is labeled in three categories: (1) hate speech, a binary value in-dicating if hate speech is occurring against one of the given targets (women or immigrants), (2) target range, whether the hate speech is towards a generic group of people or a specific individual and (3) aggressiveness, whether the tweeter is aggressive or not. Since we are only detecting abusive language, we omit the latter two labels.

The annotation process involved non-trained contributors on the crowd-sourcing platform Figure Eight13

. The organisers gave the annotators a series of guidelines, including the definition for hate speech against the two targets considered, the aggressiveness’s definition and a list of examples14

. Figure Eight reported an anno-tator agreement of 0.83 on the field we are interested in. The use of crowd-sourcing has been successfully already experimented in several tasks and in hate speech de-tection too, for example with theDavidson et al. (2017) data set. However,Basile

12 https://bit.ly/2XAzeE8 13 https://www.figure-eight.com/ 14 https://github.com/msang/hateval/blob/master/annotation_guidelines.md

(19)

et al.(2019) decided to add two more expert annotators to all the crowd-annotated data, to maximise consistency with the guidelines in order to provide a solid evalu-ation benchmark for this task (Basile et al.,2019). They assigned the final label for this data based on majority voting from the Figure Eight contributors, expert1, and expert2.

However, we question the quality of the annotations when looking at some ran-dom samples from the data. E.g. the sentence "Money can’t buy you happiness but it can buy you pussy" is labeled as hate speech, while "Real shit like stfu and eat my pussy since u care so much" is not. Also, "FUCK YOU TWICE YOU PUSSY ASS NIGGAS @AtlantaFalcons", "@loserscas morrissey is a fat skank bitch" and "asher is an idiot.. if 4 men take a woman upstairs of course they’re gonna rape her u fucking cunt" are not la-beled as hate speech. We believe that these examples are not quite properly lala-beled, which could thus indicate a poor annotation for the whole data set.

3.2.2 OffensEval

The training and testing material for OffensEval is the Offensive Language Identi-fication Data set (OLID), which was built specifically for this task (Zampieri et al., 2019b). OLID is a large collection of 14.100 English tweets annotated using a hi-erarchical three-layer annotation model. The three categories are: (1) offensive, a binary value indicating if a post contains any form of non-acceptable language or a targeted offense, which can be veiled or direct. This includes insults, threats, and posts containing profane language or swear words. (2) categorisation of offensive language, whether the post is a targeted or untargeted insult and (3) target iden-tification, whether the post targets an individual, a group of people considered as a unity due to the same ethnicity, gender or sexual orientation or whether the tar-get of these offensive posts does not belong to any of the previous two categories (e.g., an organization, a situation, an event, or an issue)(Zampieri et al.,2019a). We are only using the former label, which indicates whether a post contains offensive language or not, and therefore omit the latter two categories.

Just like the SemEval-2019 Task 5, the annotation process involved contributors on the crowd-sourcing platform Figure Eight. But, in the contrary,Zampieri et al. (2019b) only hired experienced annotators on the platform to ensure the quality of the annotation. All the tweets were annotated by two people. In case of dis-agreement, a third annotation was requested, and ultimately they used a majority vote.

3.2.3 WaseemHovy

In their work,Waseem and Hovy(2016) provide a data set which they collected over the course of two months. The collection was done by performing an initial manual search of common slurs and terms used pertaining to religious, sexual, gender, and ethnic minorities (Waseem and Hovy, 2016). In the results, they identified frequently occurring terms that contain hate speech and a small number of prolific users from these searches. Based on this sample,Waseem and Hovy(2016) used the public Twitter search API to collect the entire corpus, filtering for tweets not written in English.

They ended up with 16,914 tweets, which they then manually annotated for hate speech. To identify hate speech,Waseem and Hovy(2016) proposed the following list:

A tweet is offensive if it: 1. uses a sexist or racial slur. 2. attacks a minority.

(20)

4. criticizes a minority (without a well founded argument).

5. promotes, but does not directly use, hate speech or violent crime. 6. criticizes a minority and uses a straw man argument.

7. blatantly misrepresents truth or seeks to distort views on a minority with unfounded claims.

8. shows support of problematic hash tags. E.g. “#BanIslam”, “#whoriental”, “#whitegenocide”

9. negatively stereotypes a minority. 10. defends xenophobia or sexism.

11. contains a screen name that is offensive, as per the previous criteria, the tweet is ambiguous (at best), and the tweet is on a top

Using this list,Waseem and Hovy(2016) annotated 3.383 tweets as sexist content, 1.972 as racist and 11.559 as neither sexist or racist. The inter-annotator agreement is 0.84. In most of the disagreement cases, they found that the disagreement is reliant on context or the lack thereof.

OffensEval WH HatEval

NOT ABU NOT ABU NOT ABU

train 8,840 4,400 11,559 5,346 5,217 3,783

dev na. na. na. na. 427 573

test 620 240 na. na. 1,719 1,252

total 14,100 16,905 12,971

Table 7: Label distribution of each data set.

3.2.4 Comparison of Data Sets

As the three data sets all target a different aspect of abusive language, we collapse their labels into the category of ’abusive language’. We use two macro-categories: abusive (ABU) vs. non-abusive (NOT). Table7shows for each data set the distribu-tions of the labels in training, dev(elopment), and test.

Originally, each data set is collected and labeled following a different strategy, thus targeting a different phenomenon of abusive language. Since we collapse each label, it is still interesting to measure (dis)similary between these data sets, thus capturing (dis)similarities between abusive language types.

We do so by using the Jensen-Shannon divergence (JSD), a method of measuring the similarity between two probability distributions15

, where a higher score means more similar data sets. The results are shown in Table8, note that we did not per-form any preprocessing on these data sets. The OffensEval data set has a higher JSD-score to the HatEval and WaseemHovy data set, than the similarity score be-tween the these latter two sets. This is a bit odd, both HatEval and WaseemHovy are labeled for hate speech, and one could thus expect a higher similarity score between these two.

In section4.3, we will test whether our models trained to detect a specific phe-nomenon of abusive language can generalise between other types of such language. As the OffensEval and WaseemHovy data set are most similar, we expect better performance when training on one of these sets and testing on the other, and vice versa.

15

(21)

HatEval OffensEval WaseemHovy

HatEval x .674 .647

OffensEval .674 x .679

WaseemHovy .647 .679 x

(22)

4

_{M E T H O D}

Within this chapter, we will describe how we evaluate our polarized word embed-dings. First, the best way to evaluate our word embeddings is by performing an extrinsic evaluation. Here, we measure the contribution of our word embeddings to a specific task, which is in our case detecting abusive language. Last, we employ an intrinsic evaluation to measure the quality of our word embeddings. Here, we will answer subquestion 1.2: How can we evaluate the "goodness" of our polarized word representations?

4.1 best algorithm for polarized embeddings

4.1.1 Simple Model Approach

To determine which word embedding performs best, we start using them in simple machine learning models. We set up our experiments as follows: for each data set, we train and test on the same data distribution using a Linear Support Vector Ma-chine (SVM), from the LinearSVCscikit-learnimplementation (Pedregosa et al., 2011). Tweets from the training and testing data are tokenized using the Tweettok-enizer from theNLTKlibrary (Bird and Loper,2004) and stopwords are removed.

Each model uses only information from a word embedding as its feature. We ap-ply max pooling over these word embeddings, ~w, to obtain a 200 or 300 dimensional representation of the full tweet instance , ~i, forGloVeandFastTextembeddings re-spectively. Words not covered by the embeddings are ignored.

Since our training and testing data consists of only tweets, we start exploiting our model with the embeddings created from Twitter data. Here, we compare our self-created polarized Twitter embeddings to the genericGloVeTwitter embeddings (Pennington et al.,2014) and the genericFastText embeddings (Bojanowski et al., 2017).

In the previous section we explained that each data set was collected and labeled following a different strategy, and thus focuses on a slightly different aspect of abusive language. The thing that does connect the three data sets is the domain, namely tweet messages. To test whether domain plays an important role in the quality of performance of our polarized word embeddings, we train our simple models also on the polarized word embeddings created from Reddit, 4chan and 8chan. This with the assumption that perhaps 8chan embeddings better capture racism, while the polarized Twitter embeddings could be better in detecting online hate. In addition, we train our simple models on the generic embeddings from Reddit and 4chan, to test whether there is a difference in performance between our polarized and generic representations. This way, we can determine whether collecting controversial, polarized data for source-driven representations does result in better performance for downstream tasks.

4.1.2 Subword Information for OOV-Words

In contrast with ignoring out of vocabulary words, i.e. words not covered by the embeddings, we employ an experimental model where we look for selected sub-words if a word is not found in the embedding vocabulary. For example, suppose the word should is not found in one of the embeddings, the model will do a look-up

(23)

4.2 building a better architecture 18

for the first and last character tri-grams sho and uld. The sum of these vectors then represents the total meaning of the unfound word.

4.1.3 Baseline Model

In order to compare the performance of our word embeddings, we use a baseline model. This model is a n-gram model, specifically using aCountVectorizerfrom the scikit-learn library (Pedregosa et al., 2011) for 2-4 word and 5-7 character n-grams.

4.2 building a better architecture

As shown in Table12and Table13in Chapter 5, the embeddings created with the FastText algorithm outperform theGloVe representations. We therefore omit the latter of these two, and continue to build a stronger architecture using theFastText word embeddings instead. Here, our goal is to boost performance and find out to what extent we can outperform our previous models.

We introduce our bidirectional LSTM model, adopted fromZhang et al.(2019). LSTM models are recurrent neural networks (RNNS) which can handle input se-quentially and therefore can take word order into account. We combine this with a bidirectional model, which allows us to process the tweets both forwards and backwards, before passing on to the next layer. For each word in the tweets, the LSTM model combines its previous hidden state and the current word’s embed-ding weight to compute a new hidden state. After using dropout to shut down a percentage of neurons of the model, we feed the information to the attention mech-anism. This mechanism emphasises the most informative words in the document and gives these more weight.

Our final model uses 512 units in the hidden layer of the BiLSTM, a batch size of 64, the Adam optimizer in combination with the default learning rate of 0.001 and a dropout of 0.4. We trained our model for 50 epochs, of which we saved the model with the lowest validation loss (Zhang et al.,2019).

4.2.1 Stacking the Embeddings

In order to boost our stronger architecture even more, we adopt an approach by Akbik et al.(2018) where they present experiments in which they compare embed-dings and their stacked combinations in downstream tasks. They concatenate both pre-trainedGloVeembeddings and task-trained (i.e. not pre-trained) character em-beddings, allowing them to report new state-of-the-art F1-scores on the CoNLL03 shared task in named entity recognition (NER).

We rely on this principle by stacking our generic embeddings with the polarized variants, only to find out whether adding polarized data leads to better performance in abusive language detection. In this case, the final words representation is given by: wi=   w_ipolarized w_igeneric  

Here, w_igeneric is the FastText Wiki News 300d, or the task-trained generic Reddit and 4chan embeddings. w_ipolarized is one of the task-trained polarized representations.

(24)

4.3 across data sets with polarized embeddings 19

4.3 across data sets with polarized embeddings

Once we have found our very best model, we want to assess how well the models trained on a particular data set of abusive language perform on a different data set, just likeKaran and Šnajder(2018) did on nine different data sets. We investigate to what extent the models trained to detect a specific phenomenon of abusive language generalise between different data sets labeled with different abusive language types. Karan and Šnajder(2018) state that the difference in performance can be traced back to two factors: (1) the difference in the types of abusive language that the data set was labeled with and (2) the differences in data set sizes.

Within this experiment, we train a BiLSTM model on the training set, and if available the development set, of X, and then use the model to label the test set of Y. Here, we apply our polarized word embedding representations as a way to improve the portability of trained models across data sets labelled with different abusive language phenomena.

4.4 intrinsic evaluation of polarized embeddings

In order to evaluate the "goodness" of our polarized word representations, we should not only look at their contribution in downstream tasks, i.e. detecting abu-sive language.Schnabel et al.(2015) introduced the comparative intrinsic evaluation for word embeddings. They trained several models on the same corpus, and polled for the nearest neighbours of words from a test set. For each word, human raters chose the most “similar” answer, and the model that got the most votes was deemed the best. We rely on this idea, by comparing the ten nearest neighbours of twenty hand-picked keywords, which are:

("woman", "homosexual", "black", "gay", "man", "immigrant", "immigrants", "mi-grant", "migrants", "trans", "gun", "afroamerican", "feminism", "feminist", "abortion", "religion", "god", "trump", "islam", "muslim")

We inspect a few of these keywords within this section. The full list of keywords and their k nearest neighbours can be found in Section7.2in the Appendix chapter. As can be seen in Table 9, there are some interesting dissimilarities between generic and polarized word representations. For example, the keyword gay has more sexual orientation-like neighbours, while the polarized embeddings show a more pornographic connection. Also, looking at the word gun, the generic embed-dings list synonyms, while our representations lean towards topics like gun-sales and gun-control. Last, in our polarized Twitter embeddings, the term religion has some neighbours which seem to question the existence of religion, which are end-religion, no-end-religion, tax-end-religion, stop-end-religion, creed-religion and fake-religion.

As can be derived from the results in Table9, embeddings created with polar-ized data contain different information than generic ones, which most of the time just represent the English language. Therefore, based on this intrinsic evaluation, it does seem that polarized embeddings indeed contain more information, and thus probably perform better at recognising hateful and offensive words. In the next chapter we show our results on downstream tasks and discuss whether our po-larized embeddings, based on extrinsic evaluation tools, prove superior to generic ones.

(25)

4.4 intrinsic evaluation of polarized embeddings 20

keyword FastTextgeneric % FastTextpolarized (Twitter) %

gay homosexual 0,8233 gay[1] 0,7827

lesbian 0,7950 gayjizz 0,7732 heterosexual 0,7645 gaygay 0,7669 queer 0,7477 gay[2] 0,7582 non-gay 0,7415 gay3sum 0,7580 gays 0,7365 gayporn 0,7544 bisexual 0,7213 boygay 0,7542 LGBT 0,7188 gaynude 0,7527 Gay 0,7100 gayebony 0,7509 homophobic 0,7055 gaytube 0,7491

gun guns 0,8100 gunz 0,6913

handgun 0,7543 re:gun 0,6749 firearm 0,7443 gun/guns 0,6681 pistol 0,7144 gunz! 0,6669 sub-machine 0,7045 ogun 0,6585 firearms 0,6988 guns’[3] 0,6473 Gun 0,6907 guns= 0,6468 submachine 0,6835 gunsales 0,6279 weapon 0,6731 guncontrol 0,6272 rifle 0,6660 guncontrol” 0,6226

religion religions 0,7804 [4]religion 0,9079

Religion 0,7552 [5]religion 0,8623 relgion 0,7289 endreligion 0,8517 religon 0,7224 noreligion 0,8354 non-religion 0,7131 taxreligion 0,8318 religious 0,7100 religion[4] 0,8267 spirituality 0,7070 stopreligion 0,8211 Christianity 0,7061 religion[6] 0,8159 faiths 0,6834 creedreligion 0,8124 Islam 0,6719 fakereligion 0,8062

Table 9: The k nearest neighbours of gay, gun and religion within the genericFastTextWiki News embeddings and our polarizedFastText embeddings, created with Twitter data. The top ten neighbours are listed following their similarity score, based on cosine similarity, in descending order. The placeholders represent the following emoji’s: [1]: rainbow, [2]: two men holding hands, [3]: flushed face, [4]: islam symbol, [5]: warning symbol, [6]: big red cross.

(26)

4.4 intrinsic evaluation of polarized embeddings 21

algorithm embedding HE 1 HE 2 OE 1 OE 2 WH vocab

Glove GloVegeneric 78.3 83.4 81.1 89.0 86.0 1.200.000 Twitter polarized 84.4 89.0 86.1 91.3 88.1 1.242.949 Reddit generic 76.4 83.3 80.2 85.9 84.3 3.462.839 Reddit polarized 72.5 80.3 75.8 83.4 80.4 2.242.769 4chan generic 75.8 82.0 78.3 84.9 84.3 4.420.417 4chan polarized 80.1 85.0 82.3 87.1 86.3 3.437.450 8chan polarized 14.2 26.3 13.4 34.4 17.7 673.583 FastText FastTextgeneric 76.7 81.8 79.9 85.6 85.0 1.000.000 Twitter polarized 78.1 84.9 80.1 87.7 82.4 283.360 Reddit generic 75.8 82.8 79.9 85.6 84.0 295.028 Reddit polarized 74.7 81.8 78.6 84.9 82.9 219.673 4chan generic 75.3 81.6 78.1 84.7 84.1 350.680 4chan polarized 75.8 82.0 78.6 84.8 83.2 268.583 Table 10: Coverage of data set vocabulary per word embedding and the vocabulary size of

that word embedding. Best coverage per data set, based on percentage of words found in embedding, are in bold. Due to the lack of horizontal space, data sets are abbreviated to [HE 1]: HatEval train, [HE 2]: HatEval test, [OE 1]: OffensEval train, [OE 2]: OffensEval test, [WH]: WaseemHovy.

Table 10 shows the coverage of each word embedding, i.e. how many words per data set vocabulary exist in the word embeddings. As can be observed, the polarizedGloVe embeddings, with the sole exception of the 8chan representations, have a higher coverage than the polarized FastText embeddings. This is due to the lower minimal frequency that was used when generating them. In addition, the polarized Twitter embeddings have the best coverage for both algorithms. This is due to the source of these embeddings, which is the same as the source of the data sets, namely tweet messages. Stuff like mentions, hashtags and typical Twitter slang can be found in the embeddings and the data sets, causing a higher coverage.

Interestingly, our polarized Twitter embeddings created with FastText have a better coverage on the HatEval and OffensEval data set than theFastTextgeneric embeddings, while the vocabulary size of our embeddings is almost four times smaller. This is a clear example of the usability of such polarized representations.

One other thing that is worth mentioning is the difference between the generic and polarized embeddings for Reddit and 4chan using both algorithms. The generic Reddit embeddings have a higher coverage than the polarized variants on that do-main, for bothGloVeandFastText. This is due to the size of the vocabulary, which is much bigger. However, this explanation does not apply for the 4chan embed-dings. Using theGlovealgorithm, the polarized 4chan embeddings outperform the generic variants, while its vocabulary has 1 million lesser tokens. This effect also ap-plies on theFastTextrepresentations, with the exception of the WaseemHovy data set. On the HatEval and OffensEval data sets, the polarized 4chan embeddings have a slightly better coverage, while thus having a smaller vocabulary. This indicates the ratio of polarization and thus the strength of our polarized word embeddings.

Whether a high coverage correlates with better performance on downstream tasks is examined in the next chapter, where we show and discuss the results on the extrinsic evaluation of our polarized word embeddings.

(27)

5

_{R E S U L T S A N D D I S C U S S I O N}

In this chapter we show and discuss the results following the methods and experi-ments discussed in Chapter4. The results are part of the extrinsic evaluation of our word embeddings. We can therefore use this chapter to answer our subquestion 1.3: Do the sources used to generate polarized embeddings influence performance? As men-tioned in chapter3, we collapsed all the labels into two macro-categories: abusive (ABU) vs. non-abusive (NOT).

5.1 results

5.1.1 Best Algorithm for Polarized Embeddings

Table 11 shows the results of our baseline system, where we use a SVM-model using a CountVectorizer for 2-4 word and 5-7 character n-grams. Note that the scores for WaseemHovy in Table11are based on 10-fold cross-validation, following the original experiment setting. The same Table also contains the scores of the pre-trained genericGloVeandFastTextembeddings on all three data sets. The results in this Table confirm that n-gram based models provide a really robust baseline, and that, at the same time, generic word embeddings, even if from the same domain as the training data, fail to detect both abusive and not abusive messages.

We also trained the SVM models with our generic Reddit and 4chan embeddings, generated with theGloVeandFastTextalgorithm. Table11shows the results when applying these embeddings. Here, one can already observe the better performance of the FastText algorithm and the robustness of the generic Reddit embeddings. The latter of these two is probably due to the bigger coverage of these embeddings, as we discussed in section 4.4. A better coverage means less OOV-words, which could lead to better results.

We then re-trained our SVM models using the newly generated polarized em-bedding representations. Table12lists the scores of our self-created polarized word embeddings generated withGloVe, while Table13shows the results of these source-driven embeddings generated using theFastTextalgorithm. As is shown in these Tables, the embeddings generated with theFastTextalgorithm prove to be superior to theGloVe word embeddings. We therefore omitted the latter of these two, and continued building a stronger architecture using the FastText word embeddings instead.

What can also be derived from Table 11 and 13, is that the polarized Twitter embeddings used in a SVM model yield the same, or worse macro F1-scores with respect to the models using pre-trained generic word embeddings. The polarized Reddit and 4chan embeddings, however, do seem to perform better than the generic variants from the same domain, when applied on the HatEval and WaseemHovy data set. Here, we also see the same robustness of the polarized Reddit embeddings, as we observed with the generic Reddit embeddings, which confirms the strength of this domain.

To improve the performance of the polarized FastTextembeddings, we did a subword look-up for OOV-words, i.e. words which do not occur in the embedding vocabulary. We expected an improvement of the results, as more information per sentence would be captured. However, as can be seen in Table14where we listed the results of subword look-up for the polarized Twitter embeddings, the macro

(28)

5.1 results 23

Model Data set Class P R F1 (macro)

SVM n-grams HatEval NOT .82 .18 .46 ABU .46 .94 OffensEval NOT .83 .83 .70 ABU .56 .55 WaseemHovy NOT .84 .81 .74 ABU .62 .68 SVMGloVegeneric HatEval NOT .72 .18 .45 ABU .45 .90 OffensEval NOT .83 .71 .64 ABU .45 .62 WaseemHovy NOT .82 .79 .70 ABU .58 .63 SVMGloVereddit_generic HatEval NOT .62 .24 .45 ABU .43 .80 OffensEval NOT .79 .73 .61 ABU .38 .54 WaseemHovy NOT .80 .75 .66 ABU .53 .61 SVMGloVe4chan_generic HatEval NOT .66 .16 .42 ABU .44 .89 OffensEval NOT .79 .77 .62 ABU .45 .47 WaseemHovy NOT .80 .77 .68 ABU .55 .60 SVMfastTextgeneric HatEval NOT .74 .38 .56 ABU .49 .82 OffensEval NOT .83 .80 .69 ABU .53 .57 WaseemHovy NOT .83 .81 .72 ABU .61 .64 SVMFastTextreddit_generic HatEval NOT .73 .32 .52 ABU .47 .84 OffensEval NOT .84 .78 .68 ABU .52 .61 WaseemHovy NOT .85 .75 .71 ABU .58 .71 SVMFastText4chan_generic HatEval NOT .76 .29 .51 ABU .47 .88 OffensEval NOT .81 .77 .65 ABU .48 .54 WaseemHovy NOT .85 .74 .71 ABU .57 .72

Table 11: SVM model results: n-grams vs. genericGloVeembeddings vs. genericFastText

embeddings. Best scores, based on macro F1, are in bold.

F1-scores decreased or remained the same, we therefore did not use this method in further experiments.

5.1.2 Building a Better Architecture

Within this subsection we list the results of our bidirectional LSTM neural network on the three different data sets, using the pre-trained genericFastTextembeddings and our self-created generic and polarized embedding representations as its input. As was shown in the previous subsection, our polarized Twitter embeddings seemed to fail when used with a simple SVM model. However, when using our

(29)

5.1 results 24

SVMGloVetwitter_polarized

HatEval NOT_ABU .68_.45 .24_.84 .47

OffensEval NOT .79 .65 .58 ABU .38 .54 WaseemHovy NOT .83 .70 .68 ABU .52 .69 SVMGloVereddit_polarized HatEval NOT .61 .27 .47 ABU .43 .76

OffensEval NOT_ABU .81_.43 .71_.57 .63

WaseemHovy NOT .81 .76 .67 ABU .54 .60 SVMGloVe4chan_polarized HatEval NOT .62 .17 .42 ABU .43 .86 OffensEval NOT .79 .75 .61 ABU .42 .47 WaseemHovy NOT .81 .72 .67 ABU .52 .64 SVMGloVe8chan_polarized HatEval NOT .65 .19 .44 ABU .44 .86 OffensEval NOT .78 .67 .58 ABU .37 .51 WaseemHovy NOT .79 .72 .65 ABU .49 .59

Table 12: SVM model results: polarizedGloVeembeddings, generated from Twitter, Reddit, 4chan and 8chan. Best scores, based on macro F1, are in bold.

SVMFastTexttwitter_polarized

HatEval NOT_ABU .75_.49 .38_.83 .56

OffensEval NOT .82 .73 .64 ABU .45 .57 WaseemHovy NOT .83 .77 .70 ABU .57 .66 SVMFastTextreddit_polarized HatEval NOT .76 .38 .56 ABU .49 .84 OffensEval NOT .82 .78 .66 ABU .49 .56

WaseemHovy NOT_ABU .84_.59 .78_.68 .72

SVMFastText4chan_polarized HatEval NOT .77 .31 .54 ABU .48 .87 OffensEval NOT .82 .74 .65 ABU .47 .59 WaseemHovy NOT .84 .77 .72 ABU .58 .68

Table 13: SVM model results: polarizedFastTextembeddings, generated from Twitter, Red-dit and 4chan. Best scores, based on macro F1, are in bold. Here, we can clearly observe that the Reddit embeddings outperform the other two domains.

polarized Reddit and 4chan embeddings in these models, we yield better scores when applied on the HatEval and WaseemHovy data set. Things are even more in-teresting when we apply our embeddings in a more complex network. As is shown in Table 15, the BiLSTM models using the polarized Twitter and Reddit embed-dings actually perform better than the generic embedembed-dings of that domain, on all three data sets. The polarized 4chan embeddings, however, perform worse than the