Politically Oriented Embeddings for

(1)

Politically Oriented Embeddings

for

Abusive Language Detection

Roy David Master thesis Information Science Roy David S2764989 August 27, 2019

(2)

A B S T R A C T

This thesis focuses on the use of politically biased embeddings for abusive language. We created polarized embeddings based on politically oriented messages from so-cial media to compete with more generic embeddings such as GloVe trained on Twitter data andfastTexttrained on Wikipedia 2017, UMBC webbase corpus and statmt.orgnews dataset.

The challenges for this thesis, for the most part, have to do with data collection limitations with regard to data collection on social media. Facebook can not be scraped anymore, scraping Twitter for free is limited andgab.aican not be scraped for free. Because of these limitations we also have limitations in the amount of data available, restricting the available data to an English Twitter corpus from 2016 provided by the University of Groningen1

. The data limitation also resulted in a size limitation for the creation of our polarized embeddings (in terms of vocabulary and tokens) compared to the generic embeddings. Therefore we needed methods to maximize the performance of our polarized embeddings.

To maximize the amount of data that was available for this task, we tried two methods for data collection. We tried collecting the data using on user meta-data and using on handpicked keywords. To maximize the performance of our polarized embeddings we evaluated two embedding libraries:GloVeandfastText, based on three in-domain data sets: HatEval (Basile et al.,2019), OffensEval (Zampieri et al., 2019a), Waseem & Hovy (Waseem and Hovy, 2016) and one out-of-domain data set: StackOverflow. Furthermore we experimented with the available subword in-formation within the fastText library to try and maximize the performance. To maximize the use of our embeddings we experimented with two classification mod-els: a simple robust model consisting of a Support Vector Machine (SVM) and a deep-learning model consisting of a Bidirectional Long Short Term Memory (BiL-STM) classifier. We also tried improving our best model by stacking the generic and polarized embeddings or increasing the epochs and patience for the BiLSTM. This to compare our best model to the current state-of-the-art in abusive language detection, comparing our performance on both the HatEval and OffensEval data set to the performance of the teams participating in SemEval 2019 Task 5 & Task 6 (Basile et al.,2019;Zampieri et al.,2019b).

In the end our politically polarized embeddings created with the fastText li-brary deemed to be reliable, outperforming the pre-trained generic embeddings overall, showing a better portability across abusive data sets. The source code of this thesis can be obtained fromhttps://github.com/roydavid957/ma-thesis.

1

http://bit.ly/2XkU8ai

(3)

C O N T E N T S

Abstract i

Preface iii

1 _introduction 1

2 _background 3

2.1 Abusive Language Detection in General . . . 3

2.2 Word Embeddings for Abusive Language . . . 4

2.3 Linear vs. Neural Models for Abusive Language Detection Tasks . . . 5

2.4 Key Findings . . . 6

3 _{data and material} 7 3.1 Collection . . . 7

3.2 Annotation . . . 9

3.3 Processing . . . 10

4 _method 11 4.1 Contribution of Embeddings . . . 11

4.1.1 Pre-trained generic embeddings vs. n-gram-based models . . . 11

4.1.2 Polarized embeddings vs. pre-trained generic embeddings . . 12

4.2 Contribution of the Models . . . 14

4.2.1 Simple linear model vs. deep-learning model . . . 14

4.2.2 Improving the models . . . 14

5 _{results and discussion} 16 5.1 Results . . . 16

5.1.1 Same-distribution experiments . . . 16

5.1.2 Cross-data experiments . . . 20

5.1.3 Improved model experiments . . . 21

5.2 Discussion . . . 23

6 _conclusion 26 6.1 Limitations . . . 26

6.2 Future Work . . . 27

7 _appendices 32 7.1 Top 10 Nearest Neighbors: Polarized & Generic Embeddings . . . 32

7.1.1 Top 10 nearest neighbors: polarizedfastTextembeddings . . 32

7.1.2 Top 10 nearest neighbors: pre-trained generic fastText em-beddings . . . 33

7.1.3 Top 10 nearest neighbors: generic_fastTextembeddings with subword information . . . 36

7.1.4 Top 10 nearest neighbors: pre-trained genericGloVeembeddings 38 7.2 Top 10 Nearest Neighbors: Alternate Versions of the Polarized Em-beddings . . . 39

7.2.1 Top 10 nearest neighbors: polarized _fastText embeddings shuffled version . . . 40

7.2.2 Top 10 nearest neighbors: polarized fastText embeddings randomly initialized version 2 . . . 42

7.2.3 Top 10 nearest neighbors: polarized fastText embeddings randomly initialized version 3 . . . 45

(4)

P R E F A C E

Since the master thesis is my final course before completing my masters in Informa-tion Science at the University of Groningen, I want to dedicate this preface to all the staff and teachers of the University of Groningen, who have helped me completing my masters degree over the past year. In special I would like to thank Tommaso Caselli for his help and guidance during my master thesis. I also want to thank my fellow students Leon Graumans, Inga Kartoziya and Balint Hompot for their input and help where needed. I would like to thank the Center for Information Technol-ogy of the University of Groningen for their support and for providing access to the Peregrine high performance computing cluster.

(5)

1

_{I N T R O D U C T I O N}

The increasing popularity of social media platforms such as Twitter for both per-sonal and political communication has seen a well-acknowledged rise in the pres-ence of toxic and abusive speech on these platforms (Kshirsagar et al., 2018). Al-though the terms of services on these platforms typically forbid hateful and ha-rassing speech, the volume of data requires that ways are found to classify on-line content automatically (Merenda et al.,2018b; Kennedy et al.,2017). The problem of detecting, and therefore possibly limit the hate speech diffusion, is becoming fundamental (Nobata et al.,2016). This rise of interest has resulted in data sets in multiple languages1

, and shared evaluation exercises, such as the GermEval 2018 Shared Task (Wiegand et al.,2018), the SemEval 2019 Task 5: HateEval2and Task 6: OffensEval3

, and the EVALITA 2018 Hate Speech Detection task (Bosco et al.,2018). Which led to different terminologies, perspectives, and understandings of the phe-nomenon (Kumar et al.,2018;Waseem et al.,2017). Because of this, there is still a lack of focus on the re-usability of data sets, as well as on shared definitions. Despite the lack of consensus on what currently abusive language is, we followFounta et al. (2018) and consider abusive language as “[a]ny strongly impolite, rude or hurtful language using profanity, that can show a debasement of someone or something, or show intense emotion.” (Founta et al.,2018, 495).4

This thesis aims at generating biased, polarized word embeddings for abusive language detection trained on politically oriented messages and comments collected from social media. The idea behind using politically oriented data for the embed-dings is that discussing politics often lead to controversy. Where people’s opin-ions and emotopin-ions can lead to heated discussopin-ions or even abusive language. The goal of this study is to find out whether polarized embeddings trained on politi-cally oriented messages and/or comments collected from social media can compete with more generic embeddings likeGloVe’s Twitter trained embeddings ( Penning-ton et al.,2014) and_fastText’s trained on Wikipedia 2017, UMBC webbase corpus andstatmt.orgnews dataset trained embeddings (Bojanowski et al.,2016;Mikolov et al., 2018) in abusive language detection. One of the major downsides of using these pre-trained embeddings is that the data used for training them is often very different from the data on which they may be applied (Ruder,2017). This study will focus on English data, since most of the currently obtainable abusive data are in English. This thesis is based on a main and a sub-research research question, inherent to the main research question:

1. Can politically oriented messages and comments collected from social media be used to generate reliable word embeddings for abusive language detection tasks?

(a) What is/are the best method(s) to derive these embeddings?

The hypothesis is that the polarized embeddings, although smaller than the generic embeddings, are better compatible with the abusive phenomena inside the abusive data, therefore score higher on at least the abusive parts of the data. To do this we need to move to biased data with a controversial/heated perspective and find this data to create models. Therefore strictly related to the research questions is the problem of how much data can be collected from social media. This is a (potential)

1 http://bit.ly/2RZUlKH 2 http://bit.ly/2EEC7Me 3 http://bit.ly/2P7pTQ9 4

”Twitter-based Polarised Embeddings for Abusive Language Detection”, an ACII 2019 submission, by Tomasso Caselli, Leon Graumans and Roy David.

(6)

introduction 2

problem since Facebook can not be scraped anymore, scraping Twitter is possible but restricted, scrapinggab.iais not available for free, and since this thesis specifi-cally targets comments from politispecifi-cally oriented sources on social media.

In this work, we present a methodology for generating biased word embedding representations from social media, based on politically oriented messages, as a way to improve the portability of trained models across data sets labelled with differ-ent abusive language phenomena. We thus compare and investigate the perfor-mance of the polarized embeddings against pre-trained generic embeddings across three related in-domain data sets and one out-of-domain data set for abusive lan-guage detection. Furthermore we explore the in-data and cross-data use of a linear and a deep-learning model, two embedding libraries and several methods for data-collection.

(7)

2

_{B A C K G R O U N D}

In this chapter we will discuss some background regarding abusive language detec-tion. Covering recent related shared evaluation tasks, some previous work as well as the state-of-the-art. First we go over some recent and previous work regarding abusive language, discussing previous and more recent findings and the problems within automatic abusive language detection. We then discuss the contribution of word embeddings regarding automatic abusive language detection but also the problems regarding the reliability and stability, of these word embeddings. After that we will discuss the use of more robust linear models vs. the highly expressive deep neural models. In the end we list some key findings with respect to this thesis, based on this chapter.

2.1 abusive language detection in general

Previous work concerning hate speech against immigrants and women such as Olteanu et al.(2018), observed that extremist violence tends to lead to an increase in online hate speech, particularly on messages directly advocating violence. Also, Anzovino et al. (2018) contributed to the research field by (1) making a corpus of misogynous tweets, labelled from different perspective and (2) created an ex-ploratory investigations on NLP features and ML models for detecting and classi-fying misogynistic language. Earlier,Waseem and Hovy(2016) also contributed by providing a data set of 16k tweets annotated for hate speech and a list with eleven rules which they used to classify whether a tweet is hate speech or not. In their research they found that despite presumed differences in the geographic and word-length distribution, they have little to no positive effect on performance, and rarely improve over character-level features. The only exception being gender.

Furthermore, the survey paper bySchmidt and Wiegand(2017) gives us a sum-mary of findings in previous papers relating to hate speech detection. Hate speech and sentiment analysis are closely related, therefore using lexical resources con-taining polarity are commonly used for hate speech recognition. Also, determin-ing whether a message is hateful or benign can be highly dependent on world knowledge. According to Xu et al. (2012), the use of linguistic features like POS-information does not seem to significantly improve classification performance. Ac-cording toNobata et al. (2016), combining n-gram features with a large selection of other features does improve performance. Mehdad and Tetreault (2016) found that character n-grams proved to be more predictive than token n-grams, confirm-ingWaseem and Hovy(2016). Word generalization and word embeddings are used to tackle a data sparsity problem. These techniques represent a word or a set of words as a cluster or vector. Such vectors can be used as classification features, instead of binary features which indicate the presence or frequency of particular words (Schmidt and Wiegand,2017).

However, Davidson et al. (2017) state that a key challenge for automatic hate-speech detection on social media is the separation of hate hate-speech from other in-stances of offensive language. Lexical detection methods tend to have low precision because they classify all messages containing particular terms as hate speech and previous work using supervised learning has failed to distinguish between the two categories. They used a crowd-sourced hate speech lexicon to collect tweets con-taining hate speech keywords. They use crowd-sourcing to label a sample of these tweets into three categories: those containing hate speech, only offensive language,

(8)

2.2 word embeddings for abusive language 4

and those with neither. They train a multi-class classifier to distinguish between these different categories. They find that racist and homophobic tweets are more likely to be classified as hate speech but that sexist tweets are generally classified as offensive. Tweets without explicit hate keywords are also more difficult to classify. Moreover the similarities of hate speech regarding offensive speech, Malmasi and Zampieri(2017) examined methods to detect hate speech in social media. They used a dataset containing 14509 English tweets, that were labelled by at least 3 annotators. The tweets were categorized in one of the following three classes: Hate, Offensive and Ok. They used a linear Support Vector Machine and used three groups of features to train their system with. Surface n-grams which consisted of both word n-grams and character n-grams; word skip-grams have been used to collect word bigrams to check for the longer distance dependencies; Brown clusters to put words that are semantically related or appear often together in the same context in clusters. Their results showed hate speech and offensive speech to be similar. Therefore their system had the most trouble differentiating them.

2.2 word embeddings for abusive language

Word embeddings have been widely used in modern Natural Language Process-ing applications as they provide vector representation of words. They capture the semantic properties of words and the linguistic relationship between them. These word embeddings have improved the performance of many downstream tasks across many domains (Camacho-Collados and Pilehvar,2018). Multiple ways of generating word embeddings exist, such as Neural Probabilistic Language Model (Bengio et al.,2003),_Word2Vec(Mikolov et al.,2013),_GloVe(Pennington et al.,2014), and more recentlyfastText(Bojanowski et al.,2016) andELMo(Peters et al.,2018).

Recent works regarding the use of word embeddings for hate speech detection such asKshirsagar et al.(2018) lead to believe that they are an important feature for hate speech detection. They present a neural classification system that uses minimal preprocessing. Their approach relies on a simple word embedding (SWEM)-based architecture. Their results indicate that the majority of recent deep learning models in hate speech may rely on word embeddings for the bulk of predictive power and the addition of sequence-based parameters provide minimal utility. Furthermore on word embeddings,Merenda et al.(2018a) took a special take on distant supervi-sion for collecting data to generate reliable (polarized) word embeddings. Instead of using the content on on-line social media itself, they use the sources where the content is published as proxies for labeled data. Their approach was based on findings in previous studies on on-line communities (Pariser,2011;Seargeant and Tagg, 2018). In short these studies show that communities tend to draw towards each other enhancing filter bubbles effects (decreasing diversity, distorting infor-mation, and polarizing socio-political opinions). ThereforeMerenda et al. (2018a) concluded that ”each community in the social media sphere represents a somewhat different source of data”. Based on these findings it is interesting to see the use of the sources to generate reliable word embeddings for specific tasks involving polar-ized language (e.g. hate speech detection, toxicity classification, offensive language detection).

According toIndurthi et al.(2019), word embeddings rely on the distributional linguistic hypothesis. They differ in the way they capture the meaning of the words or the way they are trained. Each word embedding captures a different set of se-mantic attributes which may or may not be captured by other word embeddings. In general, it is difficult to predict the relative performance of these word embeddings on downstream tasks. The choice of which word embeddings should be used for a given downstream task depends on experimentation and evaluation. While word embeddings can produce representations for words which can capture the

(9)

linguis-2.3 linear vs. neural models for abusive language detection tasks 5

tic properties and the semantics of the words, the idea of representing sentences as vectors is an important and open research problem (Conneau et al.,2017). Ranking first in the evaluation test set on the recent SemEval 2019 Task 5 (HatEval) subtask A for English,Indurthi et al.(2019) evaluate the quality of multiple sentence embed-dings and explore multiple training models to evaluate the performance of simple yet effective embedding-ML combination algorithms. Their models, use pretrained Universal Encoder sentence embeddings for transforming the input and SVM (with RBF kernel) for classification. Making sentence embeddings in combination with simple embedding-ML algorithms a valuable option for abusive language detection tasks.

Although word embeddings are increasingly being used as a tool to study word associations in specific corpora, it is unclear whether such embeddings reflect en-during properties of language or if they are sensitive to inconsequential variations in the source documents (Antoniak and Mimno,2018).Antoniak and Mimno(2018) find that nearest-neighbor distances are highly sensitive to small changes in the training corpus for a variety of algorithms. For all methods, including specific documents in the training set can result in substantial variations. They show that these effects are more prominent for smaller training corpora. They recommend that users never rely on single embedding models for distance calculations, but rather average over multiple bootstrap samples, especially for small corpora. Fur-thermore they find that there are several sources of variability in cosine similarities between word embeddings vectors. The size of the corpus, the length of individual documents, and the presence or absence of specific documents can all affect the resulting embeddings. While differences in word association are measurable and are often significant, small differences in cosine similarity are not reliable, especially for small corpora.Wendlandt et al.(2018) consider one aspect of embedding spaces, namely their stability. They show that even relatively high frequency words (100-200occurrences) are often unstable, providing empirical evidence for how various factors contribute to the stability of word embeddings, and they analyze the effects of stability on downstream tasks. Moreover, they define stability as the percent overlap between nearest neighbors in an embedding space. Following these works byAntoniak and Mimno(2018) andWendlandt et al.(2018) regarding this thesis, in order to claim the reliability and stability of our soon to be created and discussed polarized embeddings, it is important to test the random initialization factor within the creation of the embeddings and its influence on nearest-neighbors distances.

2.3 linear vs.

neural models for abusive

lan-guage detection tasks

Ranking first and fourth on the HatEval Spanish sub-tasks ES-A and ES-B respec-tively, Pérez and Luque (2019) found that linear models can be a match for neu-ral models. Moreover, they believe that, for this kind of challenges with small-sized datasets, preprocessing techniques, data normalization and robustness play a stronger role than model design and hyperparameter tuning. On the other hand, deep neural models are highly expressive and prone to overfitting, this requires being extremely careful with regularization. They trained linear classifiers and Recurrent Neural Networks, using classic features, such as words, bag-of-characters, and word embeddings, and also with recent techniques such as con-textualized word representations. In particular, they trained robust task-oriented subword-aware embeddings and computed tweet representations using a weighted-averaging strategy. To represent tweets, they experimented with a mixed approach of bag-of-words, bag-of-characters and tweet embeddings, which were calculated from word vectors using different averaging schemes. They used fastText (

(10)

Bo-2.4 key findings 6

janowski et al.,2016) to get subword-aware representations specifically trained for sentiment analysis tasks.

On the other hand, Badjatiya et al. (2017) explored various tweet semantic em-beddings like char n-grams, word Term Frequency Inverse Document Frequency (TF-IDF) values, Bag of Words Vectors (BoWV) over Global Vectors for Word Rep-resentation (GloVe) (Pennington et al.,2014), and task-specific embeddings learned usingfastText(Joulin et al.,2016;Bojanowski et al.,2016), CNNs (Convolutional Neural Networks) (Kim,2014) and LSTMs (Long Short-Term Memory Networks). They also explored various classification systems such as Logistic Regression, Ran-dom Forest, SVMs (Support Vector Machines), Gradient Boosted Decision Trees (GB-DTs) and Deep Neural Networks (DNNs) such as CNN, LSTM and FastText. For each of the three DNN methods, they initialize the word embeddings with either random embeddings or GloVe embeddings. They leverage CNNs for hate speech detection. They use the same settings for the CNN as described in Kim (2014). Recurrent neural networks like LSTMs can use their internal memory to process arbitrary sequences of inputs. Hence, they use LSTMs to capture long range depen-dencies in tweets, which may play a role in hate speech detection. FastText allows update of word vectors through back-propagation1

during training as opposed to the static word representation in the BoWV model, allowing the model to learn task-specific word embeddings tuned towards the hate speech labels. They use ‘adam’ for CNN and LSTM, and ‘RMS-Prop’ for FastText as their optimizer. They perform training in batches of size 128 for CNN & LSTM and 64 for FastText. Their task was to classify a tweet as racist, sexist or neither. In their experiments they tested in three parts. Part A: the baseline part using semantic embeddings. Part B: methods using only neural networks. Part C: used the average of word embeddings learned by DNNs as features for GBDTs. The method used in part B outperformed part A, part C outperformed part B. Therefore they claim that deep neural network ar-chitectures can significantly outperform the existing character and token n-grams based methods.

2.4 key findings

Some key findings, from this chapter, regarding this thesis: a key challenge within abusive language detection is the separation of hate speech from other instances of offensive language; word embeddings improve the performance of many down-stream tasks across many domains, however the initialization of embedding models is random, thus this may generate differences, hence we have to perform extra tests regarding reliability and stability; linear models can be a match for deep neural models, however linear models seem to be more robust.

1

(11)

3

_{D A T A A N D M A T E R I A L}

3.1 collection

For this thesis several data sets were needed: data sets for training and testing and a data set for the creation of the polarized embeddings. We have used three data sets from literature that cover different dimensions of abusive language to train and test our models. Different strategies have been used to collect and label the data. Two data sets explicitly address the category of hate speech, while the other addresses the category of offensive language. These data sets are all on the same language and domain, namely English tweet messages. We used an additional data set, for out of domain experiments. Table1 illustrates the properties of the data sets and the distributions of the labels in training, dev(elopment), and test.

OffensEval WH HatEval StackOverflow

NOT ABU NOT ABU NOT ABU NOT ABU

train 8,840 4,400 11,559 5,346 5,217 3,783 na. na.

dev na. na. na. na. 427 573 na. na.

test 620 240 na. na. 1,719 1,252 na. 53,978

total 14,100 16,905 12,971 53,978

Table 1: Data distribution of each dataset

Since all of the training and most of the testing data sets contain English tweet messages, we stick to the same domain for the collection of the data set for the creation of the polarized embeddings. Unfortunately scraping tweets for free using the Twitter API1

is restricted. Therefore the Twitter data set(s) provided by the Uni-versity of Groningen2

, available for RUG users, are used. During the collection of the data set for the creation of the polarized embeddings we tried several different methods for retrieving politically oriented tweets. We tried using user’s meta-data by searching for national flags in user’s full name as well as their bio. We also tried retrieving politically oriented tweets by searching for keywords based on politically oriented hashtags. The idea behind using user’s meta-data is that symbols used in user’s meta-data (fullname, bio) can show or signal someone’s identity, political opinion or affiliation (i.e. national flags might indicate nationalism or patriotism)3 . Based on the idea that national flags can indicate nationalism or patriotism and in order to stick with the English tweets, a select group of national flags, where the na-tive language is English, was chosen: Canada, United Kingdom and United States. An additional search to find politically controversial flags was done, this resulted in the addition of the rainbow flag. For the definition of what the rainbow flag represents we resort toHaldeman and Buhrke(2003), they define the rainbow flag as being ”symbolic of the diverse contributions that lesbian, gay, or bisexual, and transgender people bring to our culture, and of the diversity within the domain of sexual orientation” (Haldeman and Buhrke,2003, 145). With gay-rights often being a controversial topic within politics (i.e. gay-marriage), and the rainbow flag repre-senting these communities, we decided to include this flag. Then the user’s tweets would be retrieved and these users would be used to find communities (Mishra et al.,2018) for the retrieval of more data. However due to the lack of hits we had to drop retrieving tweets by the use of user’s meta-data. This could be because

1 http://bit.ly/2MRaH9T 2 http://bit.ly/2XkU8ai 3 http://bit.ly/2WODfAC) 7

(12)

3.1 collection 8

either in this particular data set users were not using these national flags in their full name or bio, or using national flags in full names or bios was not (yet) a trend in 2016. Therefore for the retrieval of the data for the creation of the polarized embeddings we resorted to the use of keywords based on politically oriented hash-tags. These hashtags were selected by hand and based on the US elections in 2016. The available data set, consisting of English tweets from 2016, turned out to be a great fit in terms of the purpose of collecting data for the creation of the politically polarized embeddings. Not only was this a US election year, meaning more polit-ical discussion on Twitter, this was also the year Donald Trump was elected, who himself is a controversial person with controversial political opinions. In Table2 you can see the full list of keywords used as well as their frequency and percentage based on the occurrence of a keyword per tweet (i.e. the keyword #trump occurred in 1.856.243 tweets). This method of retrieving data resulted in a total of 3.716.462 tweets containing 93.381.338 tokens.

keyword # frequency % trump 1.856.243 49.95 maga 729.900 19.64 hillary 634.169 17.06 trump2016 620.050 16.68 trumptrain 380.486 10.24 makeamericagreatagain 222.201 5.98 neverhillary 221.877 5.97 hillaryclinton 217.156 5.84 crookedhillary 177.130 4.77 donaldtrump 163.693 4.41 draintheswamp 121.007 3.26 votetrump 115.725 3.11 americafirst 102.248 2.75 hillary2016 62.142 1.67 pizzagate 59.688 1.61 realdonaldtrump 53.151 1.43 dumptrump 44.291 1.19 hillaryforprison 32.089 0.86 womenfortrump 17.617 0.47 buildthewall 16.615 0.45 trumpsarmy 13.658 0.37 womenwhovotetrump 13.678 0.37 women4trump 9.820 0.26 fucktrump 8.426 0.23 clintonemails 8.147 0.22 clintoncrimefamily 6.030 0.16 blacks4trump 5.378 0.15 trumplies 4.905 0.13 trumpstrong 4.071 0.11 christiansfortrump 301 0.01 trumpforprison 315 0.01 criminalclinton 179 0.01 total tweets 3.716.462

Table 2: Distribution of each keyword: frequency based on the occurrence of a keyword per tweet (i.e. the keyword #trump occurred in 1.856.243 tweets).

(13)

3.2 annotation 9

3.2 annotation

As stated in the Introduction1, we followFounta et al.(2018) for the definition of abusive language, therefore to better compare our results, we collapsed all the labels into two macro-categories: abusive (ABU) vs. non-abusive (NOT). The Waseem and Hovy (WH) data set (Waseem and Hovy,2016) contains Twitter messages manually annotated for racism and sexism. The authors used a list of 12 criteria to man-ually label the data. The HateEval (Basile et al.,2019) and OffensEval (Zampieri et al.,2019a) data sets were independently developed for SemEval 2019, Task 5 and 6 respectively. They have been both annotated by means of crowdsourcing. Re-garding HatEval, Basile et al. (2019) proposed a shared task on the Multilingual Detection of Hate, where participants had to detect hate speech against immigrants and women in Twitter, in a multilingual perspective, for English and Spanish. The task is divided in two related subtasks for both languages: (A) a basic task about hate speech, and (B) another one where fine-grained features of hateful contents will be investigated in order to understand how existing approaches may deal with the identification of especially dangerous forms of hate, for example those where the incitement is against an individual rather than against a group of people, and where an aggressive behavior of the author can be identified as a prominent feature of the expression of hate. We focused on subtask (A) and used only the English sec-tion. The OffensEval data set targets offensive language, a message was labeled as offensive if it contains any form of non-acceptable language (profanity) or a targeted offense, which can be veiled or direct. We consider the Waseem and Hovy, and the HatEval data sets as addressing hate speech rather than just abusive language, due to their focus on targeting certain groups or individuals.4

We consider the Offen-sEval data set as just addressing abusive language, rather than hate speech, due to the aim of the task and the annotation of the messages not necessarily targeting offense towards certain groups or individuals. We used StackOverflow Offensive Comments, i.e. on-line questions and answers, as an additional data set for out of domain experiments. The data set contains over 50M comments from the Stack-Overflow website, collected over 4 years (between 2015 and 2019). The comments are flagged by users of StackOverflow in a multi-step process and have a near-zero false positive rate. We focused on the “Rude or Offensive” subsets of comments, so that these comments are compatible with the abusive language phenomena present in the in-domain data sets. This resulted in a total of 53,978 offensive comments.

To determine the similarities between data sets we use the Jensen-Shannon Di-vergence5

score across the data sets, showing the data sets to be distinct from each other (Table 3). However the OffensEval, Waseem & Hovy and HatEval data sets seem to share more similarities than others, signalling some probable similarities in the type of abusive comments or phenomena present in these data sets. Especially the OffensEval and the HatEval data sets which share the highest similarity score overall.

dataset OffensEval WH HatEval SO

OffensEval - 0.69 0.73 0.68

WH 0.69 - 0.70 0.67

HatEval 0.73 0.70 - 0.66

SO 0.68 0.67 0.66

-Table 3: Jensen-Shannon divergence scores across the data sets. Highest score, based on similarity score, are in bold.

4

http://bit.ly/2YNLrD4

5

(14)

3.3 processing 10

3.3 processing

FollowingLuque and Pérez(2018) their basic and sentiment-oriented pre-processing approaches, the pre-processing of the data for the creation of the polarized embed-dings consisted of: stripping newlines, lower casing, replacing url’s with <url>, replacing users with<user>, replacing hashtags with <hashtag>, deleting the last token(s) of a tweet message if the last token(s) was/were a user, url or hashtag, then tokenizing the tweet with theTweetTokenizerfromnltkand then joining the tweet together by white spaces (Table4). This mix of largely basic pre-processing with some additions of the suggested semantic-oriented pre-processing is based on Bojanowski et al.(2016)’s claim that for their_fastTextlibrary no specific type of pre-processing is required, howeverGloVe,Pennington et al.(2014) for their pre-trained embeddings do use some (basic) pre-processing. After pre-processing the data for the creation of the polarized embeddings we ended up with 3.715.999 tweets and 63.420.295 tokens.

Pre-processing Tweet message

strip newlines RT @USER: what’s the difference? #Don-aldTrumpURL

lowercase rt @USER: what’s the difference?

#don-aldtrumpURL

replace url rt @USER: what’s the difference? #don-aldtrump<url>

replace usernames rt <user> what’s the difference? #don-aldtrump<url>

replace hashtag rt <user> what’s the difference? <hashtag> <url>

delete last

user/url/hash-tag ’s rt<user>what’s the difference? nltk TweetTokenizer rt<user>what’s the difference ?

Table 4: Example of the pre-processing process for the normalization of the tweets for the data for the creation of the polarized embeddings. Username and link have been made unknown (USER,URL).

(15)

4

_{M E T H O D}

Our method consisted of four different types of experiment runs, each one aimed at assessing a different contribution.

4.1 contribution of embeddings

For the contribution of embeddings we need to determine the contribution of: pre-trained generic embeddings vs. n-gram-based models, for the contribution of em-beddings in general; embedding library vs. embedding library, to determine their contribution towards the creation of the polarized embeddings; polarized embed-dings vs. pre-trained generic embedembed-dings, to determine the contribution of polar-ized embeddings.

4.1.1 Pre-trained generic embeddings vs. n-gram-based models

The first run of experiments aimed at assessing the contribution of generic word embeddings with respect to n-gram-based models. We set up our experiments as follows: for each data set, we trained and tested on the same data distribution using two Linear Support Vector Machines (SVM), using the LinearSVC scikit learn implementation (Pedregosa et al.,2011). The first model is a n-gram based model, using aCountVectorizer. Specifically, we use 1-2 word and 3-7 character n-grams. The second model uses only information from the embeddings.

For the generic embeddings we chose GloVeand fastText. GloVe(Pennington et al.,2014) uses a global logbilinear regression model, combining the advantages of two major model families: global matrix factorization (i.e. latent semantic analysis (LSA) (Deerwester et al.,1990)) and local context window methods (i.e. the skip-gram model by Mikolov et al. (2013)). fastText is a subword-aware embeddings library (Bojanowski et al.,2016), based on the skipgram model (Mikolov et al.,2013), where each word is represented as a bag of character n-grams. A vector representa-tion is associated to each character n-gram, words are being represented as the sum of these representations. This method allows for computing word representations for words that did not appear in the training data.

To determine which embedding library is superior with respect to the other, ex-periments comparing the 200 dimensions 27 billion tokens Twitter trainedGloVe em-beddings (Pennington et al.,2014) and the 300 dimensions Wikipedia 2017, UMBC webbase corpus andstatmt.org news dataset trainedfastText embeddings with and without subword information (Mikolov et al., 2018) were done. We applied max pooling over these word embeddings, ~w, to obtain a 200 (forGloVe) and 300 (for fastText) dimension representation of the full tweet instance, ~i. Words not covered by theGloVe embeddings are ignored. Since fastTextalso supports sub-word information, several methods were used to exploit this extra information for out of vocabulary words. These were set up as follows: one experiment ignored words not covered by the fastText embeddings, another tried to exploit the sub-word information of the first and last three characters of an out of vocabulary sub-word, another tried to exploit the subword information off all characters in an out of vocab-ulary word ranging from 3-6 character n-grams. After this, the trained models were applied on the test sets. The scores for OffensEval and HatEval are based on same-distribution train (+dev if available) and test data set. The scores for Waseem &

(16)

4.1 contribution of embeddings 12

Hovy (WH) are based on 10-fold cross-validation, following the original experiment setting (Waseem and Hovy,2016), we used the overall best performing embeddings library, based on macro F1 score, for the creation of the polarized embeddings.

4.1.2 Polarized embeddings vs. pre-trained generic embeddings

The second run of experiments aimed at assessing the contribution of polarized em-beddings with respect to pre-trained generic emem-beddings. The experiments were set up by creating polarized embeddings, training on politically biased Twitter data de-scribed in chapter3, using the overall best performing embeddings libary, based on macro F1 score, as a result from the previousGloVevs. fastTextexperiments. The polarized embeddings were set up with the default hyperparameters changing only the minimum frequency (vocabulary count) to 1, dimensions to 300 and a context window of 5. We used these hyperparameters for minimum frequency and context window based on results from tests on different hyperparameters from small cor-pora for polarized embeddings, done using Leon Graumans1

his experiments (Ta-ble5). Our polarized embeddings ended up with a total of 63.420.295 tokens and a

Model Data set F1 (macro)

GloVe: Twitter freq=5 window=10 HatEval .450 GloVe: Twitter freq=1 window=5 HatEval .470 fastText: Twitter freq=5 window=5 HatEval .485 fastText: Twitter freq=1 window=5 HatEval .525

Table 5: Embeddings parameters tests for polarized embeddings: higher minimum fre-quency (freq) and context windows (window) vs. lower minimum frefre-quency and context windows. Model consists of embedding library used (i.e. GloVe), source of the polarized data (i.e. Twitter), hyperparameters (i.e. freq=5 window=10). Best scores, based on macro F1, are in bold.

vocabulary of 203.700 words. We then applied the trained models on the test sets. Comparing the performance, based on macro F1 score, of the polarized embeddings against the pre-trained generic embeddings, using both models (SVM and BiLSTM), performing same-distribution experiments as well as cross-data experiments. The latter was done to evaluate the portability of the polarized embeddings compared to the pre-trained generic embeddings in a cross-test scenario. Table6 gives us a quick insight into how the vocabulary of the polarized embeddings compare against the other vocabularies of the larger generic embeddings. From Table6we can see that our polarized embeddings can compete, in terms of vocabulary overlap, with the much larger vocabularies from the generic embeddings, with respect to these data sets. Our polarized embeddings score higher on all three data sets compared to the generic GloVe embeddings and higher on two out of three data sets com-pared to the genericfastTextembeddings. To validate our embeddings and as an

embeddings/dataset OffensEval WH HatEval

polarized 86.92% 88.06% 82.94%

GloVe 83.58% 88.07% 80.67%

fastText 84.66% 89.83% 81.01%

Table 6: Coverage of the embeddings’ vocabulary per in-domain data set vocabulary. High-est scores are in bold.

intrinsic evaluation we check the nearest neighbors for some keywords for each of the embeddings. We set this up by using the k nearest neighbor method of finding similar words, based on cosine similarity, with k = 10. We handpicked a total of 20

1

(17)

4.1 contribution of embeddings 13

keywords2

, the keyword afroamerican was left-out in the end because this word did not have any similar words in any of the embeddings. Table 7 and Table8 show a couple of interesting similar words, which also show the differences between the generic and polarized embeddings. In Table7 you can see the difference between the embeddings by looking at the most similar words to the keyword trump. The genericfastText embeddings lean towards associating the keyword with the verb trump, while the polarizedfastTextembeddings lean towards associating the key-word with the person Donald Trump. The difference between theGloVegeneric and the polarizedfastTextembeddings can best be shown by the keyword trans (Table 8). The generic _GloVe embeddings lean towards associating the keyword with In-donesian free-to-air television channels, while the polarizedfastTextembeddings lean towards associating this keyword with transgender.3

In addition we perform a word/nn embed-ding fastText polar-ized fastText generic

subword fastTextgeneric

trump (’donald’, 0.87), (’rt’, 0.69), (’<user>’, 0.69), (’fdonald’, 0.68), (’donald.be’, 0.68), (’trumpdonald’, 0.66), (’vox-donald’, 0.66), (’bsdonald’, 0.64), (’<hashtag>’, 0.64), (’dumdon-ald’, 0.64) (’trumps’, 0.82), (’trumping’, 0.72), (’trumped’, 0.68), (’supercede’, 0.63), (’override’, 0.62), (’super-sede’, 0.62), (’overrule’, 0.61), (’outweigh’, 0.60), (’ruffing’, 0.59), (’outrank’, 0.56) (’trumps’, 0.82), (’trumping’, 0.72), (’trumped’, 0.68), (’supercede’, 0.63), (’override’, 0.62), (’super-sede’, 0.62), (’overrule’, 0.61), (’outweigh’, 0.60), (’ruffing’, 0.59), (’outrank’, 0.56)

Table 7: Some similar words between the generic and polarized version created with the

fastTextlibrary.

word/nn embed-ding

GloVe fastTextpolarized

trans (’transtv’,0.62), (’antv’,0.58), (’rcti’,0.58), (’sctv’,0.54), (’fanacc’,0.54), (’mubank’,0.54), (’tv’,0.54), (’indosiar’,0.54), (’si-won’,0.53), (’jkt’,0.53) (’vetrans’,0.74), (’intrans’,0.73), (’transwomen’,0.69), (’transg’,0.68), (’transes’,0.68), (’transvestites’,0.67), (’trans-bathroom’,0.67), (’transverse’,0.67), (’trans-bathrooms’,0.67), (’trans-gender’,0.66)

Table 8: Some similar words between the genericGloveand the polarizedfastText embed-dings.

stability and a reliability check on our polarized embeddings. FollowingWendlandt et al.(2018), we set this up by creating two more versions of the polarized embed-dings with the same parameters. We then measure the overlap by comparing the top 10 nearest neighbors for a given word. FollowingAntoniak and Mimno(2018) for our reliability check we randomly shuffle our data and again create another ver-sion of the polarized embeddings, using the same hyperparameters. We then again

2

See any of the complete nearest neighbor tables in the Appendix7, for the list of used keywords.

3

For the entire table per embedding containing all the keywords, nearest-neighbors and scores see Ap-pendix 7.1: Table24(fasTextpolarized), Table25(fastTextgeneric), Table26(fastTextgeneric with subword information), Table27₍GloVegeneric).

(18)

4.2 contribution of the models 14

measure the overlap by comparing the top 10 nearest neighbors for a given word. From these stability and reliability experiments only the first created version of the polarized embeddings trained on the politically biased Twitter data set, were used for the abusive language detection experiments.

4.2 contribution of the models

For the contribution of the models we need to determine the contribution of: a sim-ple linear model vs. a deep-learning model, both in same-distribution experiments as in cross-data experiments; methods for improving the models, based on macro F1 score.

4.2.1 Simple linear model vs. deep-learning model

Our third run of experiments aimed at assessing the contribution of a deep-learning model compared to a simple model (SVM). We set this up by running cross-data experiments using both models (Table 15, Table16, Table 17). Our simple model consisted of the two Linear SVC models (using the LinearSVC scikit learn imple-mentation (Pedregosa et al., 2011)). The first model is an n-gram based model, using a CountVectorizer, with 1-2 word and 3-7 character n-grams. The second model uses only information from the embeddings. The deep-learning model con-sisted of a modified version of a Bidirectional Long-Short Term Memory (BiLSTM) classifier using Keras4

, provided by Zhang et al. (2019). In combination with the BiLSTM model we used an attention mechanism, enabling it to attend differentially to more and less important content when constructing the document representation, as proposed byYang et al.(2016). LSTM models can handle input sequentially and therefore can take word order into account. We combine this with a bidirectional model, which allows us to process the tweets both forwards and backwards. For each word in the train instances, the LSTM model combines its previous hidden state and the current word’s embedding weight to compute a new hidden state. After using dropout to shut down a percentage of neurons of the model, we feed the information to the attention mechanism. This mechanism emphasizes the most informative words in the article and gives these more weight. Our final model uses 512units in the hidden layer of the BiLSTM, a batch size of 64, the_Adamoptimizer in combination with the default learning rate of 0.001 and a dropout of 0.4. We trained our model for 10 epochs, with a patience of 5, of which we saved the model with the lowest validation loss. Since deep neural models are highly expressive and prone to over-fitting, we kept the amount of epochs low regarding regularization, following the findings ofPérez and Luque(2019). After these experiments we run an extra set of experiments on the StackOverflow data set, to further evaluate the portability of the trained models, and especially of our polarized embeddings, in an out of domain scenario.

4.2.2 Improving the models

The last and final run of experiments aimed at improving our best model. In our first setup we tried to improve our model by stacking our generic and polarized embeddings (Akbik et al.,2018). The idea behind stacking these embeddings is that both embeddings (generic, polarized) cover different aspects in the data, therefore one could benefit from the other. We expect the pre-trained generic embeddings to perform better on the negative class (NOT), while we expect the polarized embed-dings to perform better on the positive class (ABU). Therefore when stacking the

4

(19)

4.2 contribution of the models 15

embeddings we expect to see at least an increase in performance on the positive class, compared to the performance of the generic embeddings on the positive class and an increase on the negative class, compared to the performance of the polar-ized embeddings on the negative class and at most a performance increase overall. We set this up by combining both the dimensions of the embeddings, which in to-tal gave us 600 dimensions (generic: 300 dimensions, polarized: 300 dimensions). Then get the 300 dimension vectors per word per embeddings and combine them into one 600 dimension vector. For words that only appeared in one of the embed-dings, the 300 dimension vectors were combined with a 300 dimension of zeros. In our second setup we chose a simple way of trying to increase the performance, this was done by increasing the amount of epochs and patience from 10 to 50 and 5 to 25respectively.

After the experiments, we compare our performance on HatEval and OffensEval to the results from SemEval 2019 Task 5 (HatEval) (Basile et al.,2019) and SemEval 2019Task 6 (OffensEval) (Zampieri et al.,2019b) to see how our best models com-pare to the current state-of-the-art.

(20)

5

_{R E S U L T S A N D D I S C U S S I O N}

In this chapter the results following the methods and experiments discussed in chapter4will be discussed with the goal to answer our research questions in chapter 6. In the first section we will go over the results, in the second section we will discuss them.

5.1 results

The results are split up into several subsections. We first go over the same-distribution experiments results, then the cross-data experiments results and lastly the improved models results.

5.1.1 Same-distribution experiments

Table 9 shows the results of the first part of our first run of experiments. This table reports the results of the different methods for using the subword informa-tion, available in the fastText embeddings, for out-of-vocabulary words. From the results we can conclude that for the pre-trained genericfastTextembeddings with subword information, the best use of the subword information with respect to out-of-vocabulary words is to ignore them. This method outperforms the other methods, based on macro F1, overall. This method especially outperforms the other methods on the HatEval data set. Our second part of our first run of

ex-Model Data set Class P R F1 (macro)

SVMfastTextsubword none

Offenseval NOT .80 .67 .60 ABU .40 .56 WH NOT_ABU .82_.57 .78_.63 .70 HateEval NOT .69 .37 .54 ABU .47 .78 SVMfastTextsubword (3 - -3) Offenseval NOT .77 .61 .55 ABU .35 .53 WH NOT .81 .75 .68 ABU .53 .61 HateEval NOT .71 .29 .50 ABU .46 .84 SVMfastTextsubword (3 - 6) Offenseval NOT .81 .65 .60 ABU .40 .60 WH NOT .81 .75 .68 ABU .54 .63 HateEval NOT .69 .14 .41 ABU .44 .91

Table 9: SVMfastTextgeneral using subwords model results using different methods for out of vocabulary (oov) words: none vs. first and last 3 n-grams (3 - -3) vs. n-grams in range 3-6 (3 - 6). Best scores, based on macro F1, are in bold.

periments shows that the pre-trained fastText embeddings are superior to the pre-trained GloVe embeddings with respect to these data sets (Table 10). Table 10reports the results of the n-gram vs. embedding based models experiments. The

(21)

5.1 results 17

pre-trainedfastText embeddings outperform the pre-trained GloVe embeddings on all data sets, especially on the HatEval data set. Therefore for the creation of the polarized embeddings we will be using thefastTextembeddings library. The results in Table10confirm that n-gram based models provide a really robust base-line, outperforming both the pre-trained genericGloVe word embeddings and the pre-trained genericfastText embeddings with subword information overall. Fur-thermore the pre-trained genericfastTextembeddings outperform the pre-trained generic fastTextembeddings with subword information, apart from the HatEval data set, overall, based on macro F1 score. Therefore those are the overall best per-forming pre-trained genericfastTextembeddings, based on macro F1 score. Thus we do not include the pre-trained generic fastTextembeddings with subword in-formation in further experiments.

Model Data set Class P R F1 (macro)

SVM n-grams Offenseval NOT .80 .91 .69 ABU .65 .43 WH NOT_ABU .82_.55 .76_.65 .70 HateEval NOT .81 .28 .52 ABU .48 .91 SVMGloVe Offenseval NOT .83 .74 .66 ABU .47 .61 WH NOT .83 .79 .71 ABU .59 .65 HateEval NOT .75 .13 .41 ABU .44 .94 SVMfastText Offenseval NOT .85 .81 .71 ABU .56 .64 WH NOT .84 .82 .73 ABU .62 .65 HateEval NOT .75 .32 .53 ABU .48 .85 SVMfastTextsubword Offenseval NOT .80 .67 .60 ABU .40 .56 WH NOT .82 .78 .70 ABU .57 .63 HateEval NOT .69 .37 .54 ABU .47 .78

Table 10: SVM model results: n-grams vs. genericGloVeembeddings vs. genericfastText embeddings. Best scores, based on macro F1, are in bold.

The second run of experiments shows that the pre-trained genericfastText em-beddings outperform the polarizedfastTextembeddings, based on macro F1 score, on all data sets apart from HatEval (Table11). Table 11reports the results of the polarized vs. pre-trained generic embeddings experiments, for the SVM model. We see a shift in performance when we switch from our (simple) linear model (SVM) to our deep-learning model (BiLSTM) (Table12). Table12reports the results of the po-larized vs. pre-trained generic embeddings experiments, for the BiLSTM model. In this setup the polarized embeddings outperform the pre-trained genericfastText embeddings, based on macro F1 score, overall, apart from the OffensEval data set. Even though the polarized embeddings have shown to outperform the pre-trained generic embeddings, on same-distribution experiments, based on best performing model, based on macro F1 score, we will keep using the pre-trained generic embed-dings for cross-data experiments. This because (i) cross-data experiments appeal to the portability of the model, data set, as well as the embeddings. Therefore we cannot assume the same result in a cross-test scenario. (ii) The focus of this thesis

(22)

5.1 results 18

SVMfastTextpolarized Offenseval NOT .83 .72 .65 ABU .46 .62 WH NOT .84 .74 .70 ABU .55 .70 HateEval NOT .73 .36 .54 ABU .48 .82 SVMfastText Offenseval NOT .85 .81 .71 ABU .56 .64 WH NOT .84 .82 .73 ABU .62 .65 HateEval NOT .75 .32 .53 ABU .48 .85

Table 11: SVM model results: polarizedfastTextembeddings vs. generalfastText embed-dings. Best scores, based on macro F1, are in bold.

is to create reliable polarized embeddings, evaluating them by comparing their per-formance to the perper-formance of pre-trained generic embeddings, based on macro F1 score. Therefore, even if we could assume the same result in a cross-test sce-nario, we need the performance-scores, precision and recall scores per class and the macro F1 scores, of the pre-trained generic embeddings, as comparison. Further-more we can see that the BiLSTM models trained with embeddings outperform the SVM models, based on macro F1 score, overall, apart from the HatEval data set. Showing its contribution in combination with embeddings for same-distribution ex-periments. Although the BiLSTM model has shown to overall outperform the SVM model, based on macro F1 score, on same-distribution experiments, we keep the SVM model for cross-data experiments since linear models seem to be more robust than deep-learning models (Pérez and Luque, 2019). Therefore when switching from same-distribution experiments to cross-data experiments, we expect the SVM models to outperform the BiLSTM models.

BiLSTMfastTextpolarized

Offenseval NOT_ABU .85_.77 .94_.56 .77

WH NOT .82 .93 .76 ABU .78 .56 HateEval NOT .76 .31 .53 ABU .48 .86 BiLSTMfastText Offenseval NOT .85 .95 .78 ABU .81 .57 WH NOT .82 .92 .76 ABU .77 .56 HateEval NOT .76 .24 .49 ABU .46 .90

Table 12: BiLSTM model results: polarizedfastTextembeddings vs. generalfastText em-beddings. Best scores, based on macro F1, are in bold.

As examples for the additional stability and reliability experiments on our po-larized embeddings, we use Table 13and Table 14 to show the calculation of the stability and reliability of the embeddings, respectively. Table 13, as an example, shows the top 10 nearest neighbors for the keyword muslim across three randomly initializedfastTextmodels trained on our politically biased Twitter data set (chap-ter3), followingWendlandt et al.(2018). Table14, as an example, shows the top 10 nearest neighbors for the keyword muslim across two randomly initializedfastText models, where fT POL v1 was trained on our politically biased Twitter data set as

(23)

5.1 results 19

is, and fT POL vShuffle was trained on a shuffled version of our politically biased Twitter data set, followingAntoniak and Mimno(2018). From Table13we can con-clude that, on average, each pair of models has nine out of ten words in common, so the stability of muslim across these three models is 90% (Table13). To generalize this score we randomly select and analyze two more keywords (woman, homosexual) in the exact same way. The stability, on average of each pair of models, of those keywords are 90% and 100%, respectively, making the overall stability score 93.33%, for these three keywords across three models. From Table14we can conclude that,

muslim

fT POL v1 fT POL v2 fT POL v3

3muslim 3muslim 3muslim

o’muslim o’muslim o’muslim

omuslim omuslim omuslim

900muslim 900muslim 900muslim

musliim promuslim promuslim

anti_muslim anti_muslim anti_muslim

promuslim musliim musliim

muslimly muslimly muslimly

x-muslim avemuslim muslime

muslim.fuck x-muslim x-muslim

Table 13: Top ten most similar words for the word muslim in three randomly initialized

fastTextmodels trained on our politically biased Twitter data set. Words in only one list are in bold; words in only two of the lists (if any) are italicized.

on average, the pair of models have eight out of ten words in common, so the re-liability of muslim across these two models is 80%. We use the same method for generalizing this score, using the same keywords. The reliability of those keywords across the two models are woman: 70% and homosexual: 100%, making the overall reliability score 83.33%, for these three keywords across two models.1

From these muslim

fT POL v1 fT POL vShuffle

3muslim 3muslim o’muslim o’muslim omuslim omuslim 900muslim musliim musliim 900muslim anti_muslim muslimly promuslim anti_muslim muslimly musli x-muslim muslime muslim.fuck promuslim

Table 14: Top ten most similar words for the word muslim in two randomly initialized

fastTextmodels trained on our politically biased Twitter data set. Words in only one list are in bold.

stability and reliability experiments only ft POL v1, the first version of the polar-ized embeddings trained on the politically biased Twitter data set, was used for the abusive language detection experiments.

1

Appendix 7.2: All 19 keywords and scores for the top 10 nearest neighbors for the alternate versions of the polarized embeddings (Table28-30).

(24)

5.1 results 20

5.1.2 Cross-data experiments

For our cross-data experiments using OffensEval as the train set we can see that our polarized embeddings outperform the pre-trained generic embeddings, based on macro F1 score, as well as the n-grams based model overall (Table15). Furthermore our best SVM model outperforms the best BiLSTM model, based on macro F1 score, overall, having a better performance on the Waseem & Hovy data set. For our

cross-Model Data set Class P R F1 (macro)

SVM OffensEval: n-grams WH NOT .73 .86 .59 ABU .50 .31 HateEval NOT .63 .41 .52 ABU .45 .67 SVM OffensEval: fastText WH NOT .74 .77 .59 ABU .45 .40

HateEval NOT_ABU .68_.47 .37_.76 .53

SVM OffensEval: fastTextpolarized

WH NOT .75 .71 .59

ABU .44 .49

HateEval NOT_ABU .68_.47 .40_.74 .54

BiLSTM OffensEval: fastText

WH NOT .73 .84 .58

ABU .48 .31

HateEval NOT .63 .48 .54

ABU .46 .62

BiLSTM OffensEval: fastTextpolarized

WH NOT .72 .85 .58

ABU .48 .30

HateEval NOT .64 .46 .54

ABU .46 .64

Table 15: OffensEval model cross test distribution results: n-grams vs. embeddings. Best scores, based on macro F1, are in bold.

data experiments using HatEval as the train set we can see that the pre-trained generic embeddings outperform our polarized embeddings, as well as the n-gram model, based on macro F1 score, on both data sets (Table16). Again, the best SVM model outperforms the best BiLSTM model, based on macro F1 score, in this case on both data sets. For our cross-data experiments using the Waseem and Hovy data set as the train set we can see that the polarized embeddings outperform the pre-trained generic embeddings, based on macro F1 score, on both data sets, as well as the n-gram model overall, having a better performance on the HatEval test set (Table17). We see the same trend again, where the best SVM model outperforms the best BiLSTM model, based on macro F1 score, on both data sets.

(25)

5.1 results 21

SVM HatEval: n-grams Offenseval NOT .73 .59 .51 ABU .30 .45 WH NOT .72 .69 .55 ABU .38 .42 SVM HatEval:fastText Offenseval NOT .78 .75 .60 ABU .42 .46 WH NOT .72 .78 .56 ABU .41 .33

SVM HatEval:fastTextpolarized

OffensEval NOT .77 .70 .57

ABU .37 .45

WH NOT .72 .64 .55

ABU .38 .47

BiLSTM HatEval:fastText

Offenseval NOT .74 .90 .54

ABU .41 .19

WH NOT .69 .71 .51

ABU .34 .32

BiLSTM HatEval:fastTextpolarized

ABU .37 .19

WH NOT .70 .77 .53

ABU .36 .28

Table 16: HatEval model cross test distribution results: n-grams vs. embeddings. Best scores, based on macro F1, are in bold.

SVM WH: n-grams Offenseval NOT .75 .85 .57 ABU .41 .28 HateEval NOT .63 .57 .55 ABU .47 .53 SVM WH:fastText Offenseval NOT .76 .73 .56 ABU .36 .40 HateEval NOT .68 .45 .56 ABU .49 .71 SVM WH:fastTextpolarized OffensEval NOT .76 .72 .57 ABU .37 .41 HatEval NOT .68 .56 .59 ABU .52 .64 BiLSTM WH:fastText Offenseval NOT .74 .90 .52 ABU .40 .17 HateEval NOT .63 .81 .56 ABU .56 .34

BiLSTM WH:fastTextpolarized

ABU .42 .13

HatEval NOT .72 .46 .58

ABU .50 .76

Table 17: Waseem and Hovy model cross test distribution results: n-grams vs. embeddings. Best scores, based on macro F1, are in bold.

5.1.3 Improved model experiments

From our last run of experiments we can see that our improved models by way of stacking the embeddings did not lead to an improvement in performance, compared to our best model (Table11, Table12), based on macro F1 score (Table18, Table19).

(26)

5.1 results 22

The scores of the improved models by way of stacking the embeddings seem to be mostly decided by the dominant embeddings, based on the performance of the best performing individual embeddings, since they have the same scores. The improved models by increasing the epochs and patience for the best BiLSTM + embeddings model did lead to a slight improvement on the OffensEval data set (Table18), based on macro F1 score, but did not lead to an improvement on the HatEval data set (Table19).

BiLSTM Stacked: fT gen + fT pol OffensEval NOT .85 .95 .77 ABU .80 .55

BiLSTM: fT gen, 50 epochs & 25 patience OffensEval NOT .85 .95 .79 ABU .81 .58

BiLSTM: fT gen (Table12) OffensEval NOT .85 .95 .78 ABU .81 .57

Table 18: OffensEval: Improved models vs. best performing model with respect to the data set. Best scores, based on macro F1, are in bold. (fT gen: genericfastText embed-dings, fT pol: polarizedfastTextembeddings)

BiLSTM Stacked: fT pol + fT gen HatEval NOT .75 .31 .53

ABU .48 .86

BiLSTM: fT pol, 50 epochs & 25 patience HatEval NOT .76 .31 .53 ABU .48 .86

SVM: fT pol (Table11) HatEval NOT .73 .36 .54

ABU .48 .82

Table 19: HatEval: Improved models vs. best performing model with respect to the data set. Best scores, based on macro F1, are in bold. (fT gen: genericfastTextembeddings, fT pol: polarizedfastTextembeddings)

Although the goal of this thesis is not to outperform the state-of-the-art, in addi-tion, to see how our best models compare to the current state-of-the-art we compare the performance of our best models on the SemEval 2019 Task 5 & 6 to the teams participating in these tasks (Table 20, Table 21). Our best models do not outper-form the current state-of-the-art, based on macro F1 score. Our best model for the HatEval data set places fourth using the same-distribution data set. We do manage to outperform the ensemble model, Grunn2019, from previous work (Zhang et al., 2019), on the SemEval 2019 Task 5 (HatEval) English Task A (Table20). Interestingly, looking at our cross-data results we see that our best performing model on the Hat-Eval test set, SVM WH: fastText polarized, scores .59 (Table 17), outperforming our best performing model, based on macro F1 score, using the same-distribution data set, on HatEval. This score would place us second compared to the scores from Table20. Comparing our best model for the OffensEval test set to the teams that

HatEval English Task A macro F1

Fermi .651

Panaetius .571

YNU_DYX .546

SVM: fT pol .540

Grunn2019 .378

Table 20: Scores, based on macro F1, of our best model (SVM: fT pol) and our previous par-ticipation (Grunn2019 (Zhang et al.,2019)) on English subtask A from the SemEval Task 5 (HatEval) shared task, compared to the top three systems in that subtask (Indurthi et al.,2019;Ding et al.,2019)

(27)

5.2 discussion 23

competed in the SemEval 2019 Task 6 (OffensEval) we place 13th.

OffensEval Sub-task A macro F1

1 .829 2 .815 3 .814 4 .808 5 .807 6 .806 7 .804 8 .803 9 .802 CNN .800 10 .798 11-12 .793-.794

BiLSTM: fT gen, 50 epochs & 25 patience .790

13-23 .782-.789

24-27 .772-.779

Table 21: Scores, based on macro F1, of our best model (BiLSTM: fT gen, 50 epochs & 25 patience) on English subtask A from the SemEval Task 6 (OffensEval) shared task, compared to the top 27 systems in that subtask. Including the CNN byZampieri et al.(2019a)

Table22reports the results of the out of domain experiments on the StackOver-flow offensive comments data set. Only the results of the trained SVM models with generic and polarized fastText embeddings are reported, as they have shown to perform better across different data distributions than the n-gram and the BiLSTM models. Precision scores are omitted as they always correspond to 1.0. Although the data set belongs to a different domain, thus already making it difficult to obtain high scores, the polarized embeddings perform much better than the pre-trained generic embeddings in all settings. The models trained with OffensEval perform better than the others, signalling a higher compatibility with respect to the abusive language phenomenon targeted in the StackOverflow offensive messages.

Model Class R F1 (macro)

Offenseval: fT GEN ABU .24 .38

Offenseval: fT POL ABU .36 .53

WH: fT GEN ABU .14 .25

WH: fT POL ABU .17 .29

HateEval: fT GEN ABU .18 .31

HateEval: fT POL ABU .27 .43

Table 22: Cross domain results on the StackOverflow positive class (ABU) instances with pre-trained genericfastText(fT GEN) and polarizedfastText(fT POL) embeddings. Best scores in bold.

5.2 discussion

Table9showed unexpected results regarding the use of subword information. Show-ing that for out of vocabulary words, ignorShow-ing them gives the best performance overall, based on macro F1 score, rather than assigning subword-based vectors to them. Either our methods or the subword information within the embeddings are not effective enough to deal with out of vocabulary words within these data sets. Furthermore the difference in performance between the pre-trained generic

(28)

5.2 discussion 24

fastTextembeddings and the pre-trained genericfastTextembeddings with sub-word information is interesting. Since the embeddings are not different in size (vocabulary, tokens) or source, the performance difference seems to be due to the subword information. As for the contribution of embeddings vs. n-grams, only the genericfastText embeddings without subword information and the use of a BiL-STM model in combination with embeddings showed better performance than the n-gram based models, showing the robustness of n-gram based models (Table 10, Table12) for the same-distribution data experiments. However, things are different when we apply the trained models across data sets (Table15, Table16, Table17). In this scenario, n-gram models fail to generalize, showing a strict dependence on their respective training data. On the other hand, the SVM trained with specific embed-ding representations actually perform better, although they all show a consistent drop, apart from the HatEval data set, when compared to the same distribution test data. As for the HatEval data set we do not see a drop in performance when changing the same-distribution training data set to any of the other-distribution training data sets, when trained on the Waseem & Hovy data the performance even increases, signalling a higher compatibility with respect to the HatEval test data set. This could be due to similarities within the abusive phenomena of these data sets, based on their Jensen-Shannon divergence (JSD) score of .70 (Table3). But, based on those scores we would expect the OffensEval data set to perform even better on the HatEval test set, than the Waseem and Hovy data set, having an even higher similarity than the Waseem and Hovy data set, .73. However this could be due to the abusive phenomena present in the Waseem and Hovy data set being more similar to the abusive phenomena present in the HatEval data set, than the abusive phenomena present in the OffensEval data set are to those present in the HatEval data set. After all, the Waseem and Hovy data set is considered as addressing hate, same as the HatEval data set. To check this we performed additional JSD tests only on the messages labeled as abusive, compared to the HatEval data set (Table 23). From Table 23we see that, again, the abusive OffensEval instances have a higher similarity score than abusive Waseem and Hovy instances. These scores do not give

dataset HatEval

ABU OffensEval 0.73

ABU WH 0.67

Table 23: Jensen-Shannon divergence scores across the data sets, only for the abusive mes-sages of the OffensEval and Waseem & Hovy data sets, compared to the HatEval data set. Highest score, based on similarity score, are in bold.

answer to the question why the Waseem and Hovy data set performs better than the OffensEval data set in a cross-test scenario on the HatEval data set. From the cross-data experiment results we see that the Waseem and Hovy data set performs better overall, due to the precision and recall, on both labels, being closer to each other (Table17), whereas for the OffensEval data set we see that the precision and recall are more divided across both labels (Table15). Another explanation could be that due to Waseem and Hovy being a bigger data set than the OffensEval part used in the cross-data experiments, 16,905 and 13,240 respectively, the model is able to learn more.

As for our simple model vs. deep-learning model performance, our simple model deemed to be more robust, showing less of a drop when applying the trained model across data sets, confirming Pérez and Luque (2019). Our deep-learning model showed its contribution on same-distribution data set experiments, consis-tently outperforming the SVM models, based on macro F1 score, apart from the HatEval data set. For the BiLSTM model this thus shows a strict dependence on their respective training data, which, apart from the increase in performance across data sets on HatEval, was as expected. Regarding Tables 20,21, the goal for this thesis was not to outperform the state-of-the-art in abusive language detection,