Cross-Domain Authorship Attribution

(1)

Cross-Domain Authorship Attribution

Gijs Wiltvank Master thesis Information Science Gijs Wiltvank s2749645 February 5, 2020

(2)

A B S T R A C T

In this paper I present two different systems to do cross-domain authorship attri-bution. More specifically, this paper concerns authorship identification. I present an adaptation of the Groningen Lightweight Authorship Detection (GLAD) (Hürli-mann et al.,2015) system developed in Groningen in 2015. This system was orig-inally developed for authorship verification and thus, is based on a collection of similarity and dissimilarity features. Furthermore, I propose a totally new system, the Gijs Authorship Identification Method (GAIM), which is based on character n-grams. Both systems are run on the data of the 2018 PAN shared task. This task comprises cross-domain authorship attribution of fan-fictions. The best performing GLAD model achieves a MACRO-F1 score of 0.651 on the evaluation data. In com-parison, the best performing GAIM model achieves a MACRO-F1 score of 0.71 on the evaluation data. This result is competitive with those of state-of-the-art systems that participated in the shared task and would have resulted in a third place out of thirteen. In comparison, the GLAD’s result lag behind and would have resulted in a seventh place.

(3)

C O N T E N T S

Abstract i

Preface iii

1 introduction 1

2 background 2

2.1 In-domain Authorship Attribution . . . 2

2.2 Cross-domain Authorship Attribution . . . 3

2.3 Cross-topic Authorship Attribution . . . 4

2.4 GLAD . . . 5 3 data 6 3.1 Collection . . . 6 3.2 Annotation . . . 7 3.3 Processing . . . 7 4 method 8 4.1 Problem Framing . . . 8 4.2 GLAD . . . 9 4.2.1 Feature extraction . . . 9 4.2.2 Verification . . . 10 4.2.3 Classification . . . 11 4.3 GAIM . . . 11 4.3.1 General system . . . 11 4.3.2 Classification . . . 12 4.3.3 Verification . . . 12

4.4 Combining Classification and Verification . . . 13

4.4.1 Meta Classifier . . . 13

4.4.2 Summing . . . 14

5 results & discussion 15 5.1 GLAD . . . 15 5.2 GAIM . . . 16 6 conclusion 18 6.1 GLAD . . . 18 6.2 GAIM . . . 18 7 bibliography 20 ii

(4)

P R E F A C E

It feels weird writing a preface for my master thesis. Therefore, I do not know what to write so I just want to thank my supervisor Malvina Nissim for her guidance during my thesis. I would also like to thank my family, especially my sister, for proofreading and supporting me during my thesis.

Enjoy the read. Gijs Wiltvank

(5)

1

_{I N T R O D U C T I O N}

In recent years, the authenticity of online information has attracted much attention. Currently, various media put a lot of emphasis on the source and authenticity of information. In the case of written documents, an important aspect of criticism relates to the authorship of these documents. Therefore, assessing the authentic-ity of written documents crucially relates to identifying the original author(s) of these documents.Stamatatos(2009) defines the following authorship analysis tasks: author verification, plagiarism detection, author profiling or characterization and detection of stylistic inconsistencies. Furthermore, he provides a broad overview of the history of the field of authorship attribution (AA). Moreover, he describes what an authorship identification (AID) problem looks like.

In every AID problem there is a set of candidate authors, a set of text samples for which the authors are known (training data) and a set of text samples for which the authors are unknown (test data). The task then consists of attributing one of the candidate authors to each of these texts from the test data. Consequently, it can be said that the development of computational AID systems, that can assist humans in various tasks in this domain (journalism, law enforcement, content moderation, etc.) carries great significance (Kestemont et al.,2018) The research at hand will investigate cross-domain AID applied to fan-fiction, which is in line with the PAN shared task of 2018. PAN is a series of scientific events and shared tasks on digital text forensics and stylometry that started in 2009. These shared tasks and events significantly contributed to attract the attention of the research community in well-defined digital text forensics tasks. For these tasks, multiple benchmark datasets have been developed to assess the state-of-the-art performance in a wide range of tasks.

As a result, the PAN shared task of 2018 is ideal for the research at hand because it allows for easy comparison to the state-of-the-art AID systems that were developed for this task. Therefore, the current research has two main goals;

• To examine to what extent the Groningen Lightweight Authorship Detection (GLAD)Hürlimann et al.(2015) developed in 2015 can be adapted to perform cross-domain AID and achieve competitive results when compared to state-of-the-art AID systems.

• To develop a new system more in line with state-of-the-art AID systems and compare its performance with those systems and the adapted GLAD system and see if its results are competitive.

The structure of my research is as follows: Chapter 2 will give a broad overview of the field of AA. In chapter 3, the data used in the current research is explained. Chapter 4 will describe the method used in this research in detail. In chapter 5, I will show and discuss the obtained results during this study. Finally, in chapter 6 I will come back to the main goals stated above and draw my conclusions with regard to these goals.

(6)

2

_{B A C K G R O U N D}

The main idea behind AA is that by measuring some textual features, it is possible to distinguish between texts written by different authors. As mentioned in the introduction,Stamatatos(2009) describes multiple AA tasks. In author verification, the task is to decide whether a given text is written by a certain author or not. Author profiling revolves around extracting information of the author from a given text. For example, age, education, sex etc. Furthermore, he mentions plagiarism detection. This is about finding similarities between two texts. The detection of stylistic inconsistencies concerns differences in writing style in a given text. This may happen in collaborative writing. Finally, AID is described as explained earlier in the introduction.

2.1 in-domain authorship attribution

Stamatatos(2009) distinguishes two approaches to the AID problem: profile-based and instance-based approaches. Profile-based meaning that all available training texts per author are concatenated in one single file from which the properties of the author’s style are extracted. In comparison, instance-based considers each training text sample as a unit that contributes separately to the attribution model. The latter approach is used by the majority of the modern authorship identification problems. This becomes evident when you view the submissions to the 2018 PAN shared task, the most significant shared task in the field of AA, (Kestemont et al.,2018) which will be discussed later.

In-domain AA is the simplest form of AA where the training data and test data are both from the same domain. For this task, a lot of high scoring models exist already, based on support vector machines (SVM) and logistic regression. This is underlined by the fact that in the PAN shared task of 2017 (Pardo et al.,2017), which focused on an in-domain task, SVMs and logistic regression performed the best. In comparison to the state-of-the-art SVMs and logistic regression models, Shrestha et al.(2017) did some promising research in the field of in-domain AA using Con-volutional Neural Networks (CNNs) to identify the authors of tweets. Furthermore, they were the first to present a CNN model based on character n-grams for AA of short texts.

There has been an earlier try byRhodes(2015), but his research focused on longer texts. The CNN byShrestha et al. (2017) captures local interactions at the charac-ter level, which are then aggregated to learn high-level patcharac-terns for modeling the style of an author. They compared their results with other common AA techniques using the Schwartz et al. (2013) data set containing approximately 9,000 Twitter users with up to 1,000 tweets each, using the same train/test splits, and normalized URLs, usernames, and numbers. Included other techniques in the comparison were logistic regression based on character bi-grams, Long Short Term Memory networks (LSTM) and a CNN trained on word embeddings. Their CNN trained on charac-ter bi-grams outperformed all other methods. Also, in line with most other recent studies their research was instance based. However, as said, it has to be noted that this research was focused on in-domain AA so we do not know how it would have performed on a cross-domain task, which is more problematic.

(7)

2.2 cross-domain authorship attribution 3

2.2 cross-domain authorship attribution

As mentioned earlier, cross-domain AA is more problematic than in-domain AA. In cross-domain AA the training data is taken from a different domain than the test data. This can be problematic because different domains can influence the way a certain author writes. For instance, the 140-character limit on Twitter as opposed to an unrestricted blog can have a massive influence on the writing style of an author.Overdorf and Greenstadt(2016) focused on cross-domain AA. The domains on which they focused were blogs, Tweets and Reddit comments. They found that at the time, state-of-the-art in-domain logistic regression methods did not perform well at all in cross-domain situations. Moreover, they demonstrated that feature selection can dramatically improve the accuracy in cross-domain situations. They demonstrate the following ways to improve the accuracy on cross-domain AA:

• Making changes that reweigh the features to reduce distortion

• Using ensemble learning to improve results by combining multiple classifiers with uncorrelated errors

• Use a mixed training method (i.e. add data from the target domain that may not be relevant to the current problem can be added to the training pool to increase accuracy)

Using their cross-domain ensemble models based on logistic regression, they achieved accuracies of around 0.60 for their Twitter-blog data set and around 0.80 for their Twitter-Reddit data set. These are good results but it has to be said that the used domains were quite similar. If the domains were more diverse or a more distinct domain was added to the data sets, maybe the accuracies would drop.

In 2018, the PAN shared task (Kestemont et al., 2018) consisted of two sub tasks, namely: Cross-domain Authorship Attribution and Style Change Detection. I will focus on the first because Cross-domain AA is the focus of my own research. As mentioned earlier, instance-based approaches are currently the most common. This is underlined by the fact that for the AA sub task there were eleven submissions, of which ten were instance-based. Kestemont et al.(2018) state that the relatively small size of the candidate author set as well as the balanced distribution of training texts over the candidate authors have positively affected this preference for instance-based methods. The AA task can be defined as a closed-set cross-fandom attribution in fan-fiction. In other words, the task is to determine the most likely author of a previously unseen document of unknown authorship across different domains. N-grams were the most popular type of features to represent texts in this task. More specifically, character and word n-grams are used by the majority of the par-ticipants, including the top participantCustódio and Paraboni(2018). With regard to the used classification algorithms, the majority of the submissions used SVMs. However, other classification techniques were used as well including neural net-works (Gagala,2018) and an ensemble model based on different logistic regression models (Custódio and Paraboni,2018). As said, the best overall result was obtained byCustódio and Paraboni(2018), with a macro F1 score of 0.685. They are followed byMurauer et al.(2018) with a macro F1 score of 0.643. They used an SVM trained on character n-grams.

It is interesting to note that the only profile-based submission achieved the third place (Halvani and Graner,2018). They achieved a macro F1 score of 0.629 using a compression-based cosine similarity measure to estimate the most likely author. Therefore, I think we still cannot say for sure whether instance-based or profile-based is the way to go for AA. However, it is clear that at the time of writing there

(8)

2.3 cross-topic authorship attribution 4

is a clear preference for an instance-based approach. Furthermore, the shared task shows that SVMs and logistic regression models are still the state-of-the-art in the field of AA.

As mentioned earlier, Gagala (2018) did experiment with an instance-based ap-proach neural network for the PAN shared task from 2018. They used a neural network trained on Part-of-speech (POS) tags and various character n-grams. Their neural network did not perform up to par with the more common techniques like SVMs and logistic regression. They achieved a macro F1 score of 0.267 placing them ninth out of eleven. This score is also well below the PAN baseline of 0.584. How-ever, upon investigating the paper, the reason for the low score became clear. Gagala (2018) claims that during the final run of their system there was a mistake in the settings of the model. As a result, the model was trained just one epoch as opposed to approximately tens of epochs that it should have been. This resulted in their low final F1 score. During the development stage, the neural network achieved F1 scores ranging from approximately 0.520 to 0.550. Still, these are not the best results but at least they are more promising than the official score they achieved. Nonetheless, it is clear that for cross-domain AA there is still a lot of room for improvement.

2.3 cross-topic authorship attribution

In addition to cross-domain AA, there also exists cross-topic AA. Sapkota et al. (2014) state that most previous research on AA assumes that the training and test data are drawn from the same distribution. However, this assumption is not very re-alistic. Therefore, they tried to improve the prediction results in cross-topic AA were the training data and test data differ in topic. For this, they found two data-sets, one having both topic and genre flavor, and the other having only cross-topic flavor. The first corpus consists of communication samples from twenty-one authors in six genres (Email, Essay, Blog, Chat, Phone Interview, and Discussion) on six topics (Catholic Church, Gay Marriage, War in Iraq, Legalization of Marijuana, Privacy Rights, and Sex Discrimination) (Goldstein-Stewart et al.,2009).

With this dataset, it is possible to see how the performance of cross-topic AA changes across different genres. The second data set is composed of texts pub-lished in The Guardian daily newspaper written by thirteen authors in one genre on four topics. It contains opinion articles (comments) about World, U.K., Culture, and Politics. Their research yielded some interesting results. They demonstrated that training on diverse topics is better than training on a single topic. Therefore, it can be said that for cross-topic AA it is beneficial to not only use more data, but also to use data from thematic variety. Moreover, they show that lexical features are closer to the thematic area and hence were an effective author discriminator in single-topic AA.

Similarly, character n-grams proved to be a very powerful feature especially in a condition where training and test documents come from different thematic areas. This was underlined by the more recent study done byOverdorf and Greenstadt (2016) and the research done in light of the PAN shared task from 2018, in which character n-grams also play a central role. Although single-topic AA is easier than cross-topic AA, the proposed model by Sapkota et al. (2014) for cross-topic AA achieved performance close or in some cases, better than that of a single-topic AA model. Another interesting conclusion of their study is that the addition of more training data from any topic, no matter how distant or close it is with the topic of documents under investigation, improves the performance of cross-topic AA.

(9)

2.4 glad 5

2.4 glad

In 2015, GLAD was developed in Groningen for the 2015 authorship verification (AV) shared task (Stamatatos et al.,2015). The task was to determine whether the author of one or more known documents is the same author of one unknown text. GLAD works as follows. The developed systems treats the task as a binary classi-fication task, training a model across the whole dataset. Based on the prediction that texts written by different authors are less similar than texts written by the same author due to author-specific patterns in writing not under the conscious control of the author, they developed a set of 29 features to model similarity or dissimilarity. The PAN 2015 test results positioned GLAD in the third- and fourth-best place out of all participants for the Dutch and Greek datasets. GLAD obtained a somewhat lower, but still competitive, score for English, which was in contrast with promising cross-validation results.

(10)

3

_{D A T A}

3.1 collection

The data that is used in this research is the dataset created for the 2018 PAN shared task. It consists of a collection of fan-fictions and their associated metadata from the authoritative community platform Archive of Our Own (https://archiveofourown.org). Fan-fiction refers to fictional forms of literature which are nowadays produced by admirers (’fans’) of a certain author in their style. The dataset contains English, French, Italian, Polish and Spanish texts that counted at least 500 tokens, according to the platform’s own internal word count. A development set and an evaluation set were created with, importantly, no overlap in authors between the development set and the evaluation set. In this research, only the English datasets are used. An overview of the used data can be found in table 1. Below, a small example of how such texts may look like is given.

graceful ones."One more," Marvelous said, sounding royally bored from his seat."She’s tired," Joe said, though not unkindly. (the fucking jerk). (...) The next moment, he was throwing it at his feet. "Clean that up!" he barked, turning his back on her and stalking towards the door - presumably to eat somewhere else. Leaving

Table 1: PAN 2018 Data overview

Candidates Known Texts Unknown Texts

Development data Problem00001 20 140 105 Problem00002 5 35 21 Evaluation Data Problem00001 20 140 79 Problem00002 15 105 74 Problem00003 10 70 40 Problem00004 5 35 16

Figure 1: Example of the ground truth file structure

(11)

3.2 annotation 7

3.2 annotation

For both the development and the evaluation datasets, the gold standard is pro-vided by the PAN organizers in a separate .json file, see figure 1 above. This .json file contains for each unknown text, the name of the text document and its true author. During development, the system will be created and evaluated on the de-velopment dataset using this provided gold standard. Therefore, the output of the developed systems will be the same structure as the provided gold standard .json file. The evaluation set is only used during the final run when the system is finished to assess its performance.

3.3 processing

The only pre-processing that is done on the data is tokenization. Other pre-processing like lower casing, punctuation removal or anything of that kind is not done.

(12)

4

_{M E T H O D}

In 2015, the GLAD system was developed in Groningen. This system focuses on determining whether two documents are written by the same author or not. As explained earlier, this is called AV. GLAD performed really well and achieved com-petitive results on the in-domain AV task of the 2015 PAN shared task (Stamatatos et al.,2015). Moreover,Halvani et al.(2019) demonstrate in their paper that GLAD is the top performing system out of fourteen different systems in their experiments. With this in mind, GLAD seems like a good starting point to try and see if it will also perform good in a cross-domain AA situation. To investigate this, I will adapt the current GLAD system developed in 2015 to ensure it is fit for the current task at hand, using the data of the PAN shared task of 2018. This task is different from the task in 2015 in two ways.

First, instead of AV, this task focuses on AID. In other words, instead of de-termining whether two texts are written by the same author, the new adapted system has to predict for an unseen document which "candidate author" is the real author from a closed set of possible candidate authors. Second, the 2018 task focuses on cross-domain data. This means that for instance, the known texts from author A are from one domain (e.g. Twitter), and the unknown texts are from a different domain (e.g. Reddit). The 2018 dataset consists of fan fictions posted on different fandoms. The fact that the data comes from different domains imposes a challenge for AA systems. This is because it could be that linguistic or stylistic traits from a certain author do not transfer between different domains, making it more difficult to identify the correct author of a given document.

As a consequence of the different nature of the task at hand, in comparison to the task from 2015, I decided to try two different approaches; treating the task as a binary classification (verification) problem and treating it as a multi-class classification (classification) problem. In addition to the two different GLAD adaptations, the same two approaches will be used to develop a new system as well. This is done especially because the GLAD system is not originally developed to do AID. Therefore, it seems interesting to examine how it compares to a system which is made specifically for this task. Both approaches will be explained in the coming sections. Before that, the differences in structure between the 2015 data and 2018data are explained.

4.1 problem framing

As a result of the differences in structure, the GLAD system from 2015 can not be used directly in combination with the 2018 data. In 2015, the data was structured as follows: the main folder contains an x amount of ’problem’ folders. In each of these problem folders, one or more known texts and exactly one unknown text are present as shown in the following example below.

Pan15training EN001 EN002 known01.txt unknown.txt 8

(13)

4.2 glad 9

The 2018 data set is structured differently. Here the main folder contains an x amount of problem folders. Now, one problem folder has multiple candidate author folders and each of these authors has multiple known texts. The unknown texts are in a separate folder altogether, as shown below.

Pan18training problem00001 candidate00001 candidate00002 known01.txt known02.txt unknown unknown01.txt unknown02.txt

4.2 glad

In this research, two different approaches will be implemented in GLAD. The clas-sification and the verification approach. Both approaches will be explained in more detail in their respective sections, but a short description will be given now. With the classification approach, multi-class classification is performed. Relevant features will be extracted from the texts and an SVM will be trained on these features and used to make predictions on unseen texts. For the verification approach, the same features will be extracted but now the task will be represented as a binary classifi-cation task. This will be done by training multiple SVM models and storing their predictions in a matrix. Finally, from this matrix the most probable predictions will be selected and saved as the final predictions.

4.2.1 Feature extraction

As mentioned above, for both approaches features are extracted from the raw texts on which SVMs are trained. For a fair comparison with the original GLAD system, I kept the same features and their extraction methods in the updated system. All features that GLAD uses revolve around the similarity, or dissimilarity, between documents. This is understandable because the original system is designed to do AV. The original GLAD system makes use of a lot of features which will be explained in short now.

As said earlier, the features all revolve around similarity or dissimilarity be-tween documents. Consequently, for the following feature explanation, the similarities and dissimilarities are always between one known text, and one unknown text. The following features are extracted from these texts. Punctuation is extracted from the texts, frequencies are stored and the cosine similarity between the vector of the known text and the unknown text are computed. The same is done for line endings, line length and n-grams up until a length of five. Furthermore, the overall vector similarity of the two tokenized documents is computed. Instead of similarities, differences (dissimilarity) are computed for letter-casing and text blocks. Letter casing meaning the ratio of uppercase characters to lowercase characters, and the proportion of uppercase characters. Text block refers to the number of lines and characters per text block. In other words, the amount of blank space between these blocks.

Average sentence length is computed as well. This is done by counting the number of tokens per sentence and calculating the average. Sentence boundaries

(14)

4.2 glad 10

for this are determined using the NLTK Punkt Tokeniser (Loper and Bird, 2002). Furthermore, compression based dissimilarity (CDM) is calculated. Hürlimann et al. (2015) describe the implemented CDM as follows: "for two documents x and y is defined as the sum of the compressed lengths of x and y divided by the compressed length of the concatenation of the two documents:

Finally, document entropy is computed and the difference between the known doc-ument entropy and unknown entropy is calculated. All features are then stored in a vector and used to train the classifier. Like aforementioned, all of the above features are computed per document, where each known document is compared with every other unknown document. This gets computationally expensive fast.

4.2.2 Verification

The verification approach, also called the matrix approach, is more different to the original system than the classification approach. For this method, all documents will be compared with each other and the relevant similarity and dissimilarity features will be extracted as explained above. As mentioned before, because all the documents in the dataset will be compared with each other, the amount of comparisons will increase exponentially with larger datasets and thus, will be computationally more expensive as dataset size increases.

First, this data set will be split in a training and test set where the training set consists of all the known texts of all the authors, and the test set consists of all the unknown texts. For training the model, the current closed-set multi-class situation has to be changed into a binary situation. This needs to be done in order to perform AV on the used datasets. This is done by training a default SVM classifier, as used in the original GLAD system, for each candidate author specifically on the training data. In order for this to work, the labels of the training data have to be changed to binary "no" and "yes" (0,1), for their respective candidate author. For example, the SVM that will be trained for candidate author 1 needs data that is binarized accordingly. This is done by looping through the training texts and the training labels in parallel and setting the new label to "1" if the current training label and candidate author match, and to "0" if they do not. These newly created binary labels are then used for training the classifier.

This classifier then predicts for the unseen documents whether or not they were written by the current candidate author. From these predictions, the proba-bilities of the "yes" prediction are stored in a matrix per document. Only the "yes" probability is saved because that is the only prediction that I am interested in, as I do not need to predict the negative class. The steps above are repeated for all the possible candidate authors. Therefore, the resulting matrix shows the probability that it is written by one of the candidate authors for each unknown document. A small example matrix can be found in table 2 below.

The example table displays a small fictional situation that shows how the matrix would look in a situation with five unknown texts and three candidate authors. In figure 2, one example row of an actual matrix is shown where each row represents a candidate, and each column corresponds to an unknown document. In this case, there are 21 unknown documents. In order to get the final prediction of the model there is one step left. This comprises selecting the highest probability for each document and saving the corresponding candidate as the model’s final

(15)

4.3 gaim 11

prediction. For the example matrix given in Table 2, the model’s final prediction would be ("Candidate2", "Candidate1", "Candidate3", "Candidate3", "Candidate1").

Table 2: Small example matrix

Unk1 Unk2 Unk3 Unk4 Unk5 Candidate1 0.276 0.333 0.244 0.253 0.292 Candidate2 0.366 0.204 0.267 0.223 0.244 Candidate3 0.265 0.234 0.356 0.294 0.286

Figure 2: Example of one row (candidate) of a possible matrix

4.2.3 Classification

The multi-class classification method is more straightforward. Just like before, all documents are compared with each other, the relevant features are extracted and saved as mentioned earlier and the resulting data set is split into a training and a test set. A default SVM is fitted on the whole training set and used to make predictions on the test set. So, instead of binary classification and thus, verifying whether a document is written by the same author as is done in the previous approach, this method is more in line with more common closed-set multi-class classification approaches.

4.3 gaim

As mentioned before, GLAD was originally not developed for AID. Therefore, a totally new system will be developed as well; the Gijs Authorship Identification Method (GAIM). In order to have comparable results, the classification technique and the verification technique will again be implemented in this system.

4.3.1 General system

Both systems make use of the same initial processing, which will be discussed in this section. Both systems will be discussed into more detail individually in the following sections. First, the relevant files are opened which in this case are; all files of which the authors are known. Second, these files are read and appended to the training set. The unknown files are read and appended to the test set. Next, the vocabulary is extracted from the training set. As seen in the literature, almost all state-of-the-art AA systems use some form of character or word n-grams. Therefore, it seemed appropriate to start experimenting with that. Experiments with different word n-grams were done at first but to no avail. Hence, I decided to use character n-gram for this system. After a substantial amount of experimentation, character seven-grams proved to be the best fit for this task on the development data. In addition to the above, experiments were done with saving only n-grams

(16)

4.3 gaim 12

that meet a certain frequency threshold. This did not improve performance of the model during development and as a result, all the character seven-grams are in the vocabulary used by the final model. This extracted vocabulary is then vectorized with SCikit’s term frequency-inverse document frequency (tf-idf) vectorizer (Pedregosa et al.,2011). One of the differences with the GLAD system is that instead of using an SVM, I opted for a logistic regression classifier. One of the reasons for this is logistic regression gave more promising results during the initial experiments, in comparison to the SVM. Furthermore, in the literature it can be seen that systems using logistic regression can compete, and even outperform, the state-of-the-art SVM systems. This is illustrated by the fact that the top contestant of the 2018 PAN shared task,Custódio and Paraboni(2018) used models based on logistic regression.

4.3.2 Classification

In line with the more common classification techniques, this system trains a logistic regression classifier on all the available vectorized training data. Once trained, the model can make predictions on the unseen vectorized data out of one of the possible candidate authors. In short, this is just simple closed-set multi-class classification as we have seen before. During development, some parameter tuning experiments were done. Specifically with the "C" parameter and the "solver" parameter. However, these experiments did not increase the performance of the model. Consequently, the final system uses a logistic regression model with the following default parameters: solver=’lbfgs’, penalty=’l2’, C=1, multi-class=’multinomial’.

4.3.3 Verification

I was eager to implement the verification technique in GAIM because of its promising results during the adaptation of GLAD. In line with how the adapted GLAD system works, this system trains a model for each candidate author on the training data, which has been converted to have binary labels, as is explained in its respective section (4.2.2). However, this has as a result that the training data for each trained model is very unbalanced. For example, in a situation with twenty candidate authors with each seven known documents, the total amount of documents in the training data would be 140 (20 * 7). In this situation, the classifier for the first candidate would be trained on all 140 documents of which only seven belong to the first candidate. This is the case as well for the GLAD system but I did not want to change too much to the original system. Furthermore, initial testing with down-sampling did not yield different results. Therefore, I decided not to experiment further with down-sampling in GLAD.

Because the new system does not rely on document comparisons, I decided that down-sampling the majority class of the training data could be a good idea. After experimenting with different amounts of down-sampling, I found that down-sampling the majority class to fifteen documents gave the best overall results on the development data. With this down-sampled data, the remaining steps are the same as with the GLAD verification model. The only difference between GAIM and GLAD is that, GAIM implements logistic regression instead of an SVM. In short, a logistic regression model per candidate is trained on the down-sampled and binarized training data.

Now, the model is ready and is used to make predictions for the unknown documents for that given candidate author. This is done for all the possible candidates and the probabilities per unknown document are saved in a matrix for each candidate author. This matrix then contains for each unseen document, the

(17)

4.4 combining classification and verification 13

probability that a given candidate author has written it (see figure 2 and table 2 again for an example). Finally, for each document the most likely author is saved as the final prediction. Again, experiments with the "C" parameter and the "solver" parameter were done. In the case of the solver parameter, no other setting than the default resulted in better performance. For the "C" parameter, setting it to "2" resulted in slightly better performance. This has as a result that the final system uses a logistic regression model with the following parameters: solver=’lbfgs’, penalty=’l2’, C=2.

4.4 combining classification and verification

During the evaluation phase of the GAIM systems, it was interesting to see that, while their overall results were comparable, the predictions per dataset differed considerably between the systems. As a result, I tried different methods to combine the predictions of both individual systems, in order to increase overall performance. One way of combining two different classifiers is by creating a meta, or stacking, classifier.

4.4.1 Meta Classifier

State-of-the-art meta classifiers have proved to yield competitive, or even better results than standard classification methods on text classification tasks. Conse-quently, combining the output of the two separate systems and training a new model on this output seems an interesting idea. However, for meta classifiers to work properly, it is essential that cross-validation is used during the training phase of the model in order to ensure that the meta classifier actually learns from the unseen data. The data that is used in this research prevents this. Because the data is split up in folders for known texts from candidate authors, and a separate folder containing all the unknown texts, performing cross validation on this dataset is no trivial task. Therefore, I decided to try and develop a meta classifier and train it on the whole development set, instead of using cross-validation, just to see whether it might learn something interesting or not.

The meta classifier that I have developed works as follows. First, the initial classification and verification models are run on the training and test data and their prediction probabilities are saved. Then, the prediction probabilities of the two models are simply concatenated with each other. This results in a new vector where for each possible candidate author, there are two probabilities per unknown document. Next, a new classifier is trained not on the regular features, but on this new concatenated training probabilities vector that is saved earlier. Finally, this newly trained classifier is used to make the final predictions on the unseen data. In this case, the vector that results from concatenating the prediction probabilities of the classification and verification models, on the test data that were saved earlier. The same parameter experiments were done with the logistic regression model as before. As a result, the meta classifier is a logistic regression model with the following parameters: solver=’lbfgs’, penalty=’l2’, C=2.

When running the meta classifier on the development data, I found that the results were underwhelming in comparison to the other methods. Upon inves-tigation I found that a possible reason for the underwhelming performance, in addition to the lack of cross-validation, could be the difference in scale of the assigned probabilities. For example, the classification method classifier performs multi-class classification. Therefore, the probabilities that are assigned to a given candidate can vary with a different amount of candidate authors per dataset. In

(18)

4.4 combining classification and verification 14

comparison, the verification approach always performs binary classification and thus its probabilities are always "larger". In order to get all the probabilities in the same scale, the following computation is applied to the probabilities of the verification method. All verification probabilities individually are divided by the sum of all the verification probabilities, resulting in weighed probabilities. This computation did not increase or decrease the results for the meta classifier. However, as the different probabilities are now in the same scale, it did give me a different idea on how to combine the two different prediction probabilities.

4.4.2 Summing

With the probabilities of both methods in the same scale, I decided that instead of creating a meta classifier, I could try and sum the probabilities of both systems and then make the final prediction. In comparison to the developed meta classifier where the two prediction probabilities are concatenated after they are weighed, they are now summed. Therefore, the resulting vector has the same length as both vectors individually. In other words, in comparison to the vector used by meta classifier, the summed vector still only has one probability per candidate author per document. Consequently, the resulting matrix has the same structure as the matrix created with the normal verification technique. Therefore, the only thing that is left to do after the summing of the probabilities, is to select the highest probability per document and its corresponding candidate author.

(19)

5

_{R E S U L T S & D I S C U S S I O N}

5.1 glad

Table 3: GLAD results on English development data

Classification Verification

MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1

Prob1 0.464 0.562 0.534 0.676

Prob2 0.665 0.81 0.774 0.857

Average 0.565 0.686 0.654 0.767

Table 4: GLAD results on the English evaluation data.

Classification Verification

MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1

Prob1 0.553 0.696 0.590 0.734

Prob2 0.586 0.716 0.739 0.797

Prob3 0.638 0.8 0.696 0.875

Prob4 0.571 0.688 0.577 0.687

Average 0.587 0.725 0.651 0.773

During development, it became clear that the GLAD system is not designed for, and perhaps too complex for the task at hand. This is reflected by the results on the English development dataset which can be seen in table 3. When comparing these results with the results of the new GAIM system, it becomes clear that GAIM outperforms GLAD. Especially with the larger dataset, in this case "Prob1". In addition to the mediocre results of GLAD, all the document comparisons that are done within GLAD, result in a system with a long run-time. For example, running one of the two methods on the large dataset already results in a run-time of roughly one hour. In addition to this, the used datasets in this research are relatively small when compared to other text classification tasks. As a result, using GLAD on more extensive datasets could become problematic quickly. Consequently, the similarity features seem to be less suited for this AA task than for the original task because of the long run-time, and the mediocre results on the development data in general. Nevertheless, The results on the evaluation set are shown in table 4. It is noteworthy that for both the development, and evaluation dataset, the verification approach achieves the best overall results. However, even these results are not competitive when compared to the results of the PAN 2018 contestants, shown in table 7. From this table it becomes clear that the GLAD system using the verification approach with an average MACRO-F1 score of 0.651, would have achieved seventh place out of thirteen. Furthermore, GLAD’s MACRO-F1 score is 0.111 points lower than the score of the top contestant, which is quite remarkable. Furthermore, it is interesting to see that GLAD using the verification approach performs bad on the first and fourth dataset ("Prob1 & Prob2"). Bad performance on the first dataset of the evaluation data, which is also the largest dataset, is a trend that can be seen throughout all the results. Therefore, GLAD’s performance on this dataset is on par with the other results. Interesting is its performance on

(20)

5.2 gaim 16

dataset four, the smallest one. Both GLAD approaches achieve low scores on this dataset while the new GAIM systems achieves high scores on this dataset.

5.2 gaim

Table 5: GAIM results on the English development data.

Classification Verification Summing Meta

MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1

Prob1 0.515 0.619 0.554 0.629 0.517 0.629 0.473 0.514

Prob2 0.597 0.762 0.783 0.905 0.597 0.762 0.673 0.762

Average 0.556 0.691 0.669 0.767 0.557 0.696 0.573 0.638

Table 6: GAIM results on the English evaluation data.

Classification Verification Summing Meta

MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1 MACRO-F1 MICRO-F1

Prob1 0.557 0.710 0.583 0.722 0.583 0.747 0.56 0.709

Prob2 0.711 0.797 0.729 0.743 0.702 0.797 0.66 0.811

Prob3 0.756 0.875 0.649 0.875 0.766 0.90 0.60 0.85

Prob4 0.787 0.75 0.825 0.813 0.787 0.75 0.787 0.75

Average 0.703 0.783 0.697 0.790 0.71 0.799 0.652 0.78

The results of the newly developed systems on the development data can be found in table 5. The results of GAIM with the classification method are comparable to the results of the classification that is done with GLAD. GAIM performs better on the large dataset whereas GLAD performs better on the small development dataset. Consequently, the average MACRO-F1 scores are very similar, with GLAD achieving an MACRO-F1 score of 0.565 and the new GAIM system reaching 0.556. Looking at the development results of the verification approaches, the results are again comparable. However, it has to be said that GAIM outperformed GLAD on both the datasets, resulting in an average MACRO-F1 score of 0.669. In comparison, GLAD achieves an MACRO-F1 score of 0.654. In addition to this, the new system achieves these results with only a fraction of the run-time of GLAD. This is especially important if the system is used on larger datasets.

It is interesting to see that for the development data, both the meta classifier and summing did not result in an increase, but a decrease in performance. Study-ing the results on the evaluation dataset in table 6, a few interestStudy-ing thStudy-ings come forward. First, the new classification method constantly outperforms GLAD’s classification. Especially with the smaller datasets. The average MACRO-F1 score of 0.703 is also 0.116 points higher than the average MACRO-F1 score of GLAD. Furthermore, this score would have resulted in a theoretical third place in the 2018 shared task, as can be seen in table 7. The results of the verification approaches are more in line with each other. However, there is one big difference on the smallest dataset (Prob4) of the evaluation data. GAIM with the verification approach achieves a MACRO-F1 score of 0.825 on this data, which is the highest of all the systems on this dataset. In comparison, GLAD’s verification only manages a MACRO-F1 score of 0.577. The average MACRO-F1 score of 0.697 of the ver-ification approach is almost the same the score of the classver-ification approach (0.703). The summing technique achieves striking results on the evaluation dataset

(21)

5.2 gaim 17

when compared to the underwhelming results on the development data. As a result, the overall average MACRO-F1 score of 0.71 is the highest out of all the systems. Again, this would have resulted in a theoretical third place in the 2018 shared task following the top contestant by 0.052 points. In comparison, the meta classifier still achieves underwhelming scores. The overall MACRO-F1 score of 0.652 is lower than all of the newly developed GAIM systems. However, it has to be said that even this lowest score is almost identical to the score of the best performing GLAD system (0.651).

Table 7: PAN 2018 results on the English evaluation data.

PLACE MACRO-F1 1 0.762 2 0.744 3 0.697 4 0.685 5 0.679 6 0.672 7 0.601 8 0.573 9 0.538 10 0.376 11 0.190 12 0.037

(22)

6

_{C O N C L U S I O N}

In the introduction of this research, two main goals were stated. I will draw conclu-sions for each goal individually in their respective section.

6.1 glad

To recall, the first goal of this paper is:

• To examine to what extent the GLAD system developed in 2015 can be adapted to perform cross-domain AID and achieve competitive results when compared to state-of-the-art AID systems.

GLAD is successfully adapted from a system that performs AV, to a system that does AID. However, the fact that GLAD makes us of many similarity and dissim-ilarity features has a complex system as a result. In addition to this, GLAD relies on document comparisons to extract these features which causes it to be a compu-tationally expensive system. Especially on larger datasets, where GLAD’s run-time increases exponentially. Furthermore, the results of GLAD on the PAN 2018 evalua-tion data suggest that the earlier named similarity and dissimilarity features are not a good fit for this task. It is interesting to see that the matrix approach performed substantially better than the classification approach with GLAD. A possible reason for this can be that the implemented matrix approach is closer to AV than to AID. As a result, it could be that the similarity features and dissimilarity features perform better with this method for this reason. This can also explain the substantial per-formance gap between the two different GLAD systems. However, when looking at the second part of the stated goal, even GLAD with the matrix approach does not obtain competitive results in comparison to the state-of-the-art systems. Therefore, I think it is safe to say, that GLAD in its current, adapted state, is not a viable system for AID. The adaptation was successful, but competitive results were not obtained.

6.2 gaim

The second goal of this research is:

• To develop a new system more in line with state-of-the-art AID systems and compare its performance with those systems and the adapted GLAD system and see if its results are competitive.

In order to satisfy this goal, two new systems were developed initially. To make the results comparable, both the classification approach and the verification approach were implemented in GAIM. Both approaches in GAIM outperformed their GLAD counterparts substantially. The biggest difference in performance is seen with the classification method. This observation underlines the earlier observation that the similarity and dissimilarity features used by GLAD are not well suited for AID. GAIM also outperforms GLAD with the verification method, but by a much smaller margin. GAIM’s competitive performance also illustrates the strength of character n-grams as we have seen in the literature before.

(23)

6.2 gaim 19

In addition to the classification and verification approach, a meta and sum-ming classifier are implemented in GAIM. As explained earlier, the developed meta classifier is not a totally "correct" one due to the lack of cross-validation. This can be seen as well in the results which are good when compared to GLAD, but underwhelming when compared to the other GAIM systems. The summing classifier performed well on the evaluation data and even achieved the best overall results of all the systems, albeit by a small margin.

With regard to the stated goal earlier, I conclude that GAIM is definitely competitive when compared to other state-of-the-art AID systems that are based on some form of n-grams. Furthermore, with regard to AID, GAIM structurally outperforms GLAD and does so in a fraction of GLAD’s run-time. To conclude, I think both goals are achieved with different rates of success. GLAD’s features are not well suited for AID and, moreover, are too complicated for this task. GAIM illustrates that a simple system based on character n-grams can compete with state-of-the-art AID systems.

For future work, the adapted GLAD system could be improved if a better method is found to represent an AID problem as an AV problem. This way, the AV features implemented in GLAD could perform better. Furthermore, a solution for all the document comparisons that are currently done has to be found so that its run-time can be reduced. For GAIM, implementing a fully functional meta classifier could be interesting as the classification approach and the verification approach predict quite differently. This would require different set of data to work with or perhaps, a workaround for the lack of cross-validation with the current dataset.

(24)

7

_{B I B L I O G R A P H Y}

Custódio, J. E. and I. Paraboni (2018). Each-usp ensemble cross-domain authorship attribution: Notebook for pan at clef 2018. In CLEF.

Gagala, L. (2018). Authorship attribution with neural networks and multiple fea-tures: Notebook for pan at clef 2018. In CLEF.

Goldstein-Stewart, J., R. Winder, and R. Sabin (2009, March). Person identification from text and speech genre samples. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL 2009), Athens, Greece, pp. 336–344. Association for Computational Linguistics.

Halvani, O. and L. Graner (2018). Cross-domain authorship attribution based on compression: Notebook for pan at clef 2018. In CLEF.

Halvani, O., C. Winter, and L. Graner (2019). Unary and binary classification ap-proaches and their implications for authorship verification. ArXiv abs/1901.00399. Hürlimann, M., B. Weck, E. Berg, S. Šuster, and M. Nissim (2015, 01). Glad:

Gronin-gen lightweight authorship detection.

Kestemont, M., M. Tschuggnall, E. Stamatatos, W. Daelemans, G. Specht, B. Stein, and M. Potthast (2018). Overview of the author identification task at pan-2018: Cross-domain authorship attribution and style change detection. In CLEF. Loper, E. and S. Bird (2002). Nltk: The natural language toolkit. In Proceedings of the

ACL-02 Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics - Volume 1, ETMTNLP ’02, USA, pp. 63–70. Association for Computational Linguistics.

Murauer, B., M. Tschuggnall, and G. Specht (2018). Dynamic parameter search for cross-domain authorship attribution: Notebook for pan at clef 2018. In CLEF. Overdorf, R. and R. Greenstadt (2016, 07). Blogs, twitter feeds, and reddit

com-ments: Cross-domain authorship attribution. Proceedings on Privacy Enhancing Technologies 2016.

Pardo, F. M. R., P. Rosso, M. Potthast, and B. Stein (2017). Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. In CLEF.

Pedregosa, F., G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blon-del, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Courna-peau, M. Brucher, M. Perrot, and E. Duchesnay (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825–2830.

Rhodes, D. (2015). Author attribution with cnn’s.

Sapkota, U., T. Solorio, M. Montes, S. Bethard, and P. Rosso (2014, August). Cross-topic authorship attribution: Will out-of-Cross-topic data help? In Proceedings of COL-ING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, Dublin, Ireland, pp. 1228–1237. Dublin City University and Association for Computational Linguistics.

(25)

bibliography 21

Schwartz, R., O. Tsur, A. Rappoport, and M. Koppel (2013, October). Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1880–1891. Association for Computational Linguistics.

Shrestha, P., S. Sierra, F. González, M. Montes, P. Rosso, and T. Solorio (2017, April). Convolutional neural networks for authorship attribution of short texts. In Pro-ceedings of the 15th Conference of the European Chapter of the Association for Computa-tional Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 669–674. Association for Computational Linguistics.

Stamatatos, E. (2009, March). A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556.

Stamatatos, E., W. D. amd Ben Verhoeven, P. Juola, A. López-López, M. Potthast, and B. Stein (2015, September). Overview of the Author Identification Task at PAN 2015. In L. Cappellato, N. Ferro, G. Jones, and E. San Juan (Eds.), CLEF

2015 Evaluation Labs and Workshop – Working Notes Papers, 8-11 September, Toulouse,