Datasets and Models for Authorship Attribution on Italian Personal Writings

(1)

Datasets and Models for Authorship

Attribution on Italian Personal Writings

Gaetana Ruggiero

Master Thesis

Language and Communication Technologies

Supervised by

Malvina Nissim - University of Groningen Albert Gatt - University of Malta

(2)

A B S T R A C T

Authorship Attribution (AA) is the study of identifying authors by their writing style. Over the past few years, determining the authors of online content has played a crucial role in many fields, such as online security, plagiarism detection and fake news identification. While extensive research has been done in this field for English, little investigation has focused on Italian, with the only outstanding case being the study on Elena Ferrante’s true identity. Existing research on AA focuses on texts for which a lot of data is available (i.e novels, articles), and which are not necessarily influenced by an author’s personal writing style due to editorial interventions. This study approaches the AA task in terms of Authorship Verification (AV), a binary classification task where, given two texts, the goal is to decide whether or not they are written by the same author. Following Hürlimann et al. (2015) and inspired by the work on blogger identification ofMohtasseb et al.(2009), we run the GLAD AV system on Italian forum comments and personal diaries. We introduce two novel datasets suitable for the AV task, which can be easily adapted to work with other AA tasks. We show the complexity of the data, and analyze the interaction between four different variables, i.e. genre, topic, authors’ gender and number of words taken into account per author. We perform intra-topic, topic and cross-genre experiments and discuss the results obtained for each setting. We show that, contrarily to what other studies have proved (Sapkota et al.,2014;Stamatatos et al., 2015), cross-topic and cross-genre experiments are comparable to intra-topic ones. Although no results are available on Italian for comparison, we believe that, overall, our approach achieves competitive scores.

(3)

A C K N O W L E D G E M E N T S

Starting the LCT program was quite a challenging opportunity for me. I was first assigned to Malta, where the sun shines 350 days a year, and then sent to Gronin-gen, where the sun shines instead only for the remaining 15. However, I would not have had it any other way. Where would I have gone on Mondays after Albert’s classes in which he was explaining all those (at first) scary formulas if not a calming beach in Sliema or Pembroke? Can you imagine writing a thesis during a pandemic seeing the beach from the balcony but not being able to go swimming? After all, Malta and Groningen were the right places at the right time and I could not be more grateful for the people I met along the way.

Thank you Albert, for showing me from Day 1 what this Master was all about. Once you told me that I should think of models as a box, and that simple advice some-how made those formulas not so scary anymore. Thank you for making all your knowledge available for this work and for the support received along the way. Thank you Malvina, because during this pandemic, knowing that I had an Italian in the team made me always feel a little bit more home. Before coming to Groningen, everybody kept telling me that Learning from Data was the best course I would ever follow, and so it was, in every aspect. Thank you for all the help that you gave me for this work.

Thank you Marc, because without you none of this would have been possible. If from print(’Hello world!’) I took a few steps forward, it is entirely thanks to what you taught me. Thank you for always supporting me, for this thesis and for the many times you made time to explain things to me.

Thank you Simone. Writing a thesis is hard enough in itself, but writing a thesis in quarantine during a pandemic is certainly something no one was prepared for. Thank you for making me understand that productivity is a relative concept, and that is important to make time for yourself regardless of everything else.

Thank you Marion, because you had to stand my complaints, joys and pains for 2 years (and a pandemic). Thank you for helping me through my entire LCT journey.

Thank you Elena, because you never hung up.

Thank you Louis, because although you are more Python-experienced than I am, you never made me feel any less.

Grazie mamma e papà, di tutto.

(4)

C O N T E N T S

Abstract i Acknowledgements ii 1 introduction 1 2 background 4 2.1 Authorship Attribution . . . 4 2.1.1 Sub-Tasks . . . 4

2.1.2 Methods and Issues . . . 5

2.2 Authorship Verification . . . 8

2.3 AA for Italian . . . 10

3 data 11 3.1 ForumFree dataset . . . 11

3.2 Diaries dataset . . . 12

3.3 Differences between Datasets . . . 13

3.4 Data Challenges . . . 14

3.5 Pre-Processing . . . 16

3.6 Task Framing and Data Structuring . . . 18

3.6.1 Creating the known-unknown text pairs . . . 18

3.7 Train and Test splits . . . 20

4 method 22 4.1 GLAD Overview . . . 22 4.2 Dimensions . . . 24 4.3 Experimental Setups . . . 25 4.3.1 Experiments Overview . . . 25 4.3.2 Intra-Topic Experiments . . . 26 Same-Topic KU pairs . . . 26 Different-Topic KU pairs . . . 27

Mixed-Gender and Single-Gender KU pairs . . . 27

4.3.3 Cross-Topic Experiments . . . 27

4.3.4 Cross-Genre Experiments . . . 28

4.4 Additional Experiments . . . 28

4.4.1 Downsizing the Number of Authors . . . 29

4.4.2 Different Classifiers and Feature Combos . . . 29

4.4.3 Implicit vs. Explicit Gender . . . 29

4.5 Evaluation Metrics . . . 30

4.6 Baseline . . . 30

5 results and discussion 31 5.1 Baseline Results . . . 31

5.2 Intra-topic Experiments . . . 34

5.2.1 Medicina Estetica . . . 34

Downsizing the Number of Authors . . . 37

Explicit Gender . . . 37

5.2.2 Programmi Tv . . . 38

5.2.3 Mix . . . 41

5.2.4 Diaries . . . 44

Bleaching . . . 45

(5)

CONTENTS iv

5.3 Cross-Topic Experiments . . . 48

5.4 Cross-Genre Experiments . . . 51

5.5 Summary . . . 53

6 conclusion 57 a Distribution of Male and Female Medicina Estetica Texts . . . 63

b Topic modelling on Medicina Estetica texts . . . 64

c Downsizing the number of authors . . . 65

(6)

L I S T O F F I G U R E S

Figure 1 Number of female and male authors in the datasets . . . 13 Figure 2 Proportion of texts written by male and female authors . . . . 13 Figure 3 Comparison between the average number of words per text

and the average number of texts per author across the datasets 14 Figure 4 Example of the creation of known and unknown documents

for the same author when considering 400 words per author . 19 Figure 5 Example of pairing same-author and different-author

known-unknown document pairs. All pairs are same-author in the first two arrays (before shift), while half are different-author in the second two (after shift). (Dwyer,2017) . . . 20 Figure 6 Word-cloud showing the 85 most frequent words in female

texts . . . 36 Figure 7 Word-cloud showing the 85 most frequent words in male texts 36 Figure 8 c@1*AUC scores for Medicina Estetica texts in relation to the

number of words per author, obtained when running GLAD on the same training and test set for all the word settings of a single gender . . . 37 Figure 9 c@1*AUC scores for Programmi Tv texts in relation to the

number of words per author, obtained when running GLAD on the same training and test set for all the word settings within a single gender . . . 41 Figure 10 c@1*AUC scores in relation to the number of words per

au-thor obtained when running GLAD on the same training and test set for all the word settings of a single gender . . . 43 Figure 11 c@1*AUC scores in relation to the number of words for the

same training and test set with named-entity (NE) converted and bleached texts written by both genders . . . 47 Figure 12 c@1*AUC scores in relation to the number of words for the

same training and test set with named-entity (NE) converted and bleached texts written by females . . . 47 Figure 13 c@1*AUC scores in relation to the number of words for the

same training and test set with named-entity (NE) converted and bleached texts written by males . . . 47 Figure 14 c@1*AUC scores in relation to the number of words per

in-stance for cross-topic experiments with downsized training and test sets . . . 50 Figure 15 c@1*AUC scores in relation to the number of words per

in-stance for cross-genre experiments with downsized training and test sets . . . 53 Figure 16 Distribution of the Medicina Estetica texts written by female

authors, in function of their length . . . 63 Figure 17 Distribution of the Medicina Estetica texts written by male

authors, in function of their length . . . 63 Figure 18 Term clusters for Medicina Estetica texts written by female

authors . . . 64 Figure 19 Term clusters for Medicina Estetica texts written by male

au-thors . . . 64

(7)

L I S T O F T A B L E S

Table 1 An overview of the ForumFree dataset. The Avg words per au-thor is computed by calculating the average number of words per individual author, summing the different averages and dividing this sum by the total number of authors. The Avg comments per author is calculated by computing the mean of the number of comments associated to each author and the Avg words per comment is computed by dividing the sum of the lengths of each comment by the total number of comments. 12 Table 2 An overview of the Diaries dataset. The Avg words per author

is computed by calculating the average number of words per individual author, summing the different averages and divid-ing this sum by the total number of authors. The Avg stories per author is calculated by computing the mean of the num-ber of stories associated to each author and the Avg words per story is computed by dividing the sum of the lengths of each story by the total number of stories. . . 12 Table 3 Overview of the experimental setups . . . 26 Table 4 Example of an intra-topic setting where GLAD is trained and

tested on Programmi Tv texts written by female authors, for which we take 2 000 words . . . 26 Table 5 Example of a cross-topic setting where GLAD is trained on

Programmi Tv and tested on Medicina Estetica, while num-ber of words and gender remain the same . . . 28 Table 6 Example of a cross-genre setting where GLAD is trained on

the ForumFree Mix topic and tested on the Diaries, consider-ing male and female authors with 3 000 words each . . . 28 Table 7 Random baseline results for all the gender settings of

Medic-ina Estetica . . . 32 Table 8 Random baseline results for all the gender settings of

Pro-grammi Tv . . . 32 Table 9 Random baseline results for all the gender settings of Mix

(ForumFree dataset) . . . 33 Table 10 Random baseline results for all the gender settings of the

Diaries dataset . . . 33 Table 11 Random baseline results for the cross-topic experiments . . . 33 Table 12 Random baseline results for the cross-genre experiments . . . 33 Table 13 Training and test set configurations and IT evaluation scores

on Medicina Estetica texts written by female and male authors 34 Table 14 Training and test set configurations and IT evaluation scores

on Medicina Estetica texts written only by female authors . . 35 Table 15 Training and test set configurations and IT evaluation scores

on Medicina Estetica texts written only by male authors . . . 35 Table 16 Results of the experiments without (Implicit Gender) and

with (Explicit Gender) gender as a feature for Medicina Es-tetica texts written by females and males . . . 38 Table 17 Training and test set configurations and IT evaluation scores

on Programmi Tv texts written by female and male authors . 39 Table 18 Training and test set configurations and IT evaluation scores

on Programmi Tv texts written by female authors . . . 40

(8)

LIST OF TABLES vii

Table 19 Training and test set configurations and IT evaluation scores on Programmi Tv texts written by male authors . . . 40 Table 20 Results of the experiments without (Implicit Gender) and

with (Explicit Gender) gender as a feature for Programmi Tv texts written by females and males . . . 41 Table 21 Training and test set configurations and IT evaluation scores

on Mix texts written by female and male authors . . . 41 Table 22 Training and test set configurations and IT evaluation scores

on Mix texts written only by female authors . . . 42 Table 23 Training and test set configurations and IT evaluation scores

on Mix texts written only by male authors . . . 42 Table 24 Results of the experiments without (Gender Implicit) and

with (Gender Explicit) gender as a feature for ForumFree (Mix) texts written by females and males . . . 44 Table 25 Training and test configurations and IT evaluation scores on

diaries made of NE converted text written by both genders . 44 Table 26 Training and test configuration and IT evaluation scores on

diaries made of NE converted text written by female authors 45 Table 27 Training and test configuration and IT evaluation scores on

diaries made of NE converted text written by male authors . 45 Table 28 Training and test set sizes and evaluation scores on bleached

diaries written by both genders . . . 46 Table 29 Results of the experiments without (Implicit Gender) and

with (Explicit Gender) gender as a feature for the diaries writ-ten by females and males . . . 48 Table 30 Training and test set configurations for cross-topic experiments 49 Table 31 Evaluation scores for experiments without (Implicit Gender)

and with (Explicit Gender) the additional gender feature when training on Medicina Estetica and testing on Programmi Tv . 50 Table 32 Evaluation scores for experiments without (Implicit Gender)

and with (Explicit Gender) the additional gender feature when training on Programmi Tv and testing on Medicina Estetica . 50 Table 33 Training and test set configurations for cross-genre

experi-ments . . . 52 Table 34 Evaluation scores for experiments without (Implicit Gender)

and with (Explicit Gender) the additional gender feature when training on ForumFree and testing on Diaries . . . 52 Table 35 Evaluation scores for experiments without (Implicit Gender)

and with (Explicit Gender) the additional gender feature when training on Diaries and testing on ForumFree . . . 53 Table 36 Results of the experiments run on exactly the same training

and test set (within a single gender) for Medicina Estetica . . 65 Table 37 Results of the experiments run on exactly the same training

and test set (within a single gender) for Programmi Tv . . . . 65 Table 38 Results of the experiments run on exactly the same training

and test set (within a single gender) for the entire ForumFree dataset (Mix) . . . 66 Table 39 Scores for the mixed- and single-gender setting when

experi-menting on the same training and test set (within one gender) with NE-converted diaries . . . 66 Table 40 Scores for the mixed- and single-gender setting when

experi-menting on the same training and test set (within one gender) with bleached diaries . . . 67 Table 41 Evaluation scores on cross-topic experiments when using the

(9)

LIST OF TABLES viii

Table 42 Evaluation scores on cross-genre experiments when using the same training and test set for all the word settings . . . . 68 Table 43 Results of the experiments with and without gender as a

fea-ture for Medicina Estetica - Females and Males . . . 69 Table 44 Results of the experiments with and without gender as a

fea-ture for Programmi Tv - Females and Males . . . 69 Table 45 Results of the experiments with and without gender as a

fea-ture for the ForumFree dataset (Mix) - Females and Males . . 70 Table 46 Results of the experiments with and without gender as a

(10)

1

_{I N T R O D U C T I O N}

When in 2011 Amina Arraf, a Syrian homosexual woman, started the blog ‘A Gay Girl in Damascus’ and faced delicate Syrian political and social issues, no one had imagined that the real author behind the scenes was the American male blogger Thomas MacMaster (Afroz et al.,2012). Before the uncovering of the hoax, the blog gained enormous popularity around the Internet, and when Amina’s cousin an-nounced that the woman had been kidnapped by the Syrian police, the event caused an uproar among the public. Many major newspapers, such as The Guardian,1

the Washington Post,2

and the Daily Mail,3

reported on the abduction. The U.S State Department started an investigation,4

and this inspection led to the discovery of the deception (Afroz et al.,2012).

With the increasing amount of text around the Internet, the need of identifying the real author behind a text is becoming larger. Can we identify the author of text by comparing it to other texts of known authorship? Can we verify whether two texts are written by the same person? Authorship Attribution (AA) is the study of identifying authors by their writing style. Besides deception detection, AA has many other real-life applications. Examples are plagiarism detection (Stamatatos and Koppel, 2011), multiple accounts detection (Tsikerdekis and Zeadally, 2014), and criminal identity tracing to grant online security (Yang and Chow,2014). One of the most modern applications of AA is terrorism prevention, where, by evaluat-ing the levaluat-inguistic features of web forums messages, it is possible to identify patterns of terrorist communication (Abbasi and Chen,2005).

Perhaps the most famous example of Authorship Attribution is the case of The Fed-eralist Papers, a collection of 85 essays on the ratification of the U.S Constitution written by Alexander Hamilton, James Madison and John Jay, of which 12 of dis-puted authorship.Mosteller and Wallace(1963) show how by performing statistical analysis and observing the word frequency distribution, it is possible to attribute the disputed texts to Madison. Among the most recent examples, instead, there is the case of The Cuckoo’s Calling, a crime novel published under the pseudonym of Robert Galbraith. By using lexical features, n-grams and the 100 most frequent content words, among other sets of features,Juola(2015) show that the most likely author within a closed set of authors is J. K. Rowling, the world-renowned novelist author of Harry Potter.

Most of the research on Authorship Attribution, however, focuses on types of texts for which a lot of data is available, and mainly on English. Moreover, those types of texts are not necessarily influenced by an author’s writing style. Proof-reading and editorial interventions often introduce modifications to the original texts. Besides J. K. Rowling, for example, Juola (2015) analyzes three other cases of dis-puted authorship. The first one involves two poems written by Edgar Allan Poe; the second case investigates the true identity of the author of a collection of articles, and the third case explores the authorship of the technical documentation about

1 https://www.theguardian.com/world/2011/jun/07/syrian-blogger-amina-abdallah-kidnapped 2 https://www.washingtonpost.com/blogs/blogpost/post/gay-girl-in-damascus-syrian-blogger-allegedly-kidnapped/2011/06/07/AGIhp1KHblog.html 3 https://www.dailymail.co.uk/news/article-2000450/American-lesbian-blogger-Amina-Arraf-kidnapped-Damascus-Syria.html 4 https://www.washingtonpost.com/lifestyle/style/a-gay-girl-in-damascus-comes-clean/2011/06/12/AGkyH0RHstory.html 1

(11)

introduction 2

Bitcoins. What all of these studies have in common, is the small number of indi-vidual authors considered, and the large amount of text available for each one of them. In the last decade, Authorship Attribution has been one of the main focuses of the PAN shared tasks (Juola and Stamatatos,2013;Stamatatos et al.,2014,2015). The datasets used in the PAN competitions are made of essays, novels, newspapers articles and reviews, in 4 languages, i.e. English, Spanish, Dutch and Greek.

To the best of our knowledge, only a few studies have been performed on Italian. The most outstanding one is the case of Elena Ferrante, an Italian novelist whose real name had remained secret for over 25 years (Tuzzi and Cortelazzo,2018). Sty-lometry and different machine learning algorithms pointed out that the most likely author is Domenico Starnone. Those experiments, once again, were made possible by the large amount of text available for each author.

In many other real-world applications, however, it is unlikely to have such a big amount of data. Especially for forensic applications, a field to which belong most of the non-literary examples mentioned above, the texts of disputed authorship tend to be short. This poses additional challenges to Authorship Attribution systems, since the shorter the text, the harder it is to abstract the writing style of the author who wrote it.

In this study, we approach the Authorship Attribution task in terms of Authorship Verification (AV), a binary classification sub-task whose aim is to establish whether or not two texts are written by the same person. Compared to Authorship Identifica-tion, whose goal is to predict the author of a text among a closed set of authors, AV has been claimed to be a more realistic interpretation of the Authorship Attribution task (Koppel et al.,2009,2012).

We experiment with a less investigated type of texts, i.e. web forums comments and diary fragments. The personal nature and the shortness of these texts makes them challenging from many points of view. Therefore, we design experiments that take into account the typology of the texts at hand.

We introduce two novel datasets, ForumFree and Diaries. Although already com-piled, the ForumFree dataset was not meant for Authorship Attribution. Therefore, we reformat it following the PAN format and make it suitable for Authorship Ver-ification experiments. The Diaries dataset, instead, is originally assembled for the present study. Although both datasets are meant to work with Authorship Verifi-cation problems, they can be easily adapted to other Authorship Attribution tasks. Due to the lack of publicly available datasets for Italian AA and AV, we believe this to be one of the major contributions of our work to the field.

FollowingHürlimann et al.(2015), we run the GLAD AV system on our data. We combine four different dimensions in our experiments, i.e topic, gender, number of words per author, and genre. We analyze whether the topic of the texts and/or the gender of the authors influences the classifier’s decision. Moreover, in order to investigate how much evidence per author is needed to perform AV on Italian personal writings, we consider different amounts of words per author. To contrast the unrealistic assumption that the texts of unknown authorship are drawn from the same distribution of the texts of known authorship (Sapkota et al.,2014), we experiment not only with intra-topic but also with cross-topic and cross-genre set-tings. The aim of these experiments is to see whether the system can distinguish authorship regardless of the topic and genre of the texts it is trained and tested on.

(12)

introduction 3

In Chapter2we provide an overview of previous work on Authorship Attri-bution and Verification, which lays the foundation for the experiments per-formed in this study, the method and the type of data used.

In Chapter 3, we present the data and the way the datasets were formatted. Compared to other studies on AA, this chapter includes a more in-depth anal-ysis of the data and datasets. The novelty of the datasets, together with the complexity of the data analyzed, called for the need of a more extensive de-scription.

In Chapter4we describe the system used to perform our experiments and the experimental setup, together with the baseline system used for comparison. In Chapter5we present and discuss the results for each experimental setting

and present the main findings.

In Chapter6 we summarize our entire research, discuss limitations and pro-vide some suggestions for future work.

In AppendixAwe include histograms representing the distribution of Medic-ina Estetica texts written by females and males (respectively) in function of their length.

In AppendixBwe include the results for topic modelling on Medicina Estetica texts.

In AppendixCwe display the scores obtained when running the experiments with downsized subsets for all the topic-gender combinations.

In AppendixDwe include the results for gender explicit experiments on fe-male and fe-male subsets for all the topics.

In summary, the research questions addressed in this thesis are the following: 1. In an Authorship Verification task, do topic of the texts and gender of the

authors influence the classification? If yes, in which way?

2. How much text per author is needed in order to perform Authorship Verifica-tion on Italian web forums comments and diaries?

3. Is a system able to distinguish authorship regardless of the topic and genre it is trained and tested on?

(13)

2

_{B A C K G R O U N D}

In this chapter, we discuss previous literature on Authorship Attribution (AA). We start by distinguishing the three main approaches to AA, highlighting the essen-tial differences between them. We also discuss the most common methods used to tackle the AA task, and the issues related to the amount of training data available and the genre of these texts. Then, we take a closer look at the Authorship Verifi-cation task, defined as the fundamental problem of Authorship Attribution (Koppel et al.,2012). We present the theoretical issues related to Authorship Verification as a classification task and the most recent work related to it. Finally, we focus on AA research on Italian, by giving an overview of what, to the best of our knowledge, are the few studies in the field.

2.1 authorship attribution

Authorship Attribution is the task of quantifying literary style (Holmes,1994), or in more general terms, the study of identifying authors by their writing style. The main idea behind AA is that every person has certain patterns of language use, namely a sort of authorial fingerprint or human stylome, as defined byJuola(2008) and Van Halteren et al.(2005), respectively. The authorial fingerprint is supposedly hard to camouflage (Li, 2013). This characteristic makes the AA task computationally addressable.

2.1.1 Sub-Tasks

Within the general field of Authorship Attribution, at least three sub-tasks can be identified: Authorship Identification (AID), Authorship Verification (AV) and Au-thor Profiling (AP). Given a text of disputed auAu-thorship, the AID goal is to predict which of a closed set of authors is the author of the unknown text (Dwyer,2017). Stamatatos(2009) describes the task as follows:

In the typical authorship attribution problem, a text of unknown author-ship is assigned to one candidate author, given a set of candidate authors from whom text samples of undisputed authorship are available. From a machine learning point of view, this can be viewed as a multiclass, single-label text-categorization task.

Although AID is the most common Authorship Attribution sub-task, its validity has often been questioned. Stolerman et al. (2013) claimed that AID in a closed-world setting, i.e. when there is a fixed set of candidate authors to choose from, suffers from one fundamental flaw, namely that a classifier will always output an author of that set. When the classifier identifies a particular person as the author of an anonymous text, it only means that the unknown document in question is written in a style that is more similar to the style of that author. Koppel et al.(2012) argued that in the real world, the list of candidate authors might be very long, and there is no guarantee that the author of the unknown text is even in that list.

A more realistic approach to Authorship Attribution is thus represented by the Authorship Verification task. Koppel et al.(2012) state:

We therefore consider what we call the “fundamental problem” of au-thorship attribution: given two (possibly short) documents, determine

(14)

2.1 authorship attribution 5

if they were written by a single author or not. Plainly, if we can solve the fundamental problem, we can solve any of the standard authorship attribution problems, whether in the idealized form often considered or in the more difficult form typically encountered in real life.

The main difference between AID and AV is that in AV there are no multiple can-didate authors. Instead, there is only one suspect author, who is known to have written all the known documents (Li,2013). The unknown text is compared to the set of known documents and is labeled as YES or NO, depending on whether or not the author of the two texts is the same. The task is thus a binary classification task. However, an alternative approach to AV, like the one of Koppel and Schler (2004), sees the AV task as a one-class classification problem, explained later on in Section2.2, together with a more extensive analysis of AV.

While AID and AV can be analyzed in light of the more general text classification task, Author Profiling cannot be ascribed to the same framework (Li,2013). Rather than trying to identify the real author of a text, AP is used to distinguish between classes of authors (Rangel et al.,2013). AP focuses on finding, for example, the age, gender, native language and personality type of an author (Rangel et al.,2013), so that an author profile can be built (Li,2013). We include AP in the AA sub-tasks overview for the sake of completeness. However, the literature that will follow is mainly focused on AID and AV problems.

2.1.2 Methods and Issues

The most useful overview on Authorship Attribution studies is perhaps the work of Stamatatos(2009), a comprehensive survey on the most common methods used to solve the AA task. Although published in 2009, this study is still resonant within the AA community. Stamatatos(2009) identifies four categories of features: lexical and character features, which represent the text as a mere sequence of words or char-acters, syntactic features, such as POS tagging, semantic features, such as semantic similarity between words and, finally, application specific features, like greetings and farewells in emails.

When trying to identify the author of a text (AID) or deciding whether or not two texts are written by the same person (AV), it seems intuitive to think that the most unusual words (thus lexical features) or phrases that characterize a certain author would be helpful features. However, function words have proved to be more rele-vant than content words. Koppel et al.(2009) highlight how the frequency of func-tion words is not expected to differ enormously among texts of different topics writ-ten by the same author. At the same time, the use of function words does not seem to be consciously controlled by the author (Koppel et al.,2009). Kestemont (2014) supports the idea, adding that the high frequency of those words makes them in-teresting from a quantitative point of view, and that all authors writing in the same language and time period are likely to use the same function words, reducing again the differences between topics and genres. Nonetheless,Kestemont(2014) observes that the importance of function words drops drastically in languages which make more use of inflections and that function words are, in a way, language dependent. He suggests that the best way to represent function words in a text is to make use of character n-grams, which also makes the approach language independent. Character level features, in fact, have proved to be the most effective for AA tasks: the most recent best performing AV systems submitted to the PAN 2015 shared task, such asMoreau et al.(2015) andHürlimann et al.(2015) (used as main system in the current study), make use of variable length character level features. All those sys-tems were designed to deal with different languages, namely English, Dutch, Greek

(15)

and Spanish (Stamatatos et al.,2015). The efficiency of the stylometric features men-tioned above, the language independent approach and its portability, made GLAD, the system designed byHürlimann et al.(2015), a perfect choice for this study. A detailed description of GLAD’s functioning is offered in Section4.1.

In terms of how the available training texts are organized, Stamatatos (2009) dis-tinguishes between profile- and instance-based approaches to AA, In the former ap-proach, all the texts of a single author are concatenated into one big string, which is used to extract the characteristics of that particular author. A single unseen text is then compared to each author file and the authorship is determined by a distance measure (Stamatatos,2009). Instance-based approaches, instead, consider each training text as a separate document that contributes to the attribution method (Stamatatos, 2009). The known and unknown document of an instance are thus both made of a single text. In profile-base approaches, a single unknown document is compared to the concatenation of more documents, while in instance-based ap-proaches, a single unknown document is compared to a single known document. Our approach resemble the profile-based approaches because we also concatenate the known texts of an author into a single string. However, unlike in profile-based approaches, our unknown text is also made of concatenated documents. The fact that each instance of our datasets is made of two texts of the same type (i.e. two texts made of concatenated documents), makes our approach instance-based.

The choice of concatenating different texts of the same author in a single string originates directly from the characteristics of the data we chose to use in this study, namely forum comments, and diaries. As explained in Chapter1, this data is sub-stantially different from the data most commonly used in AA studies, mostly in terms of length of the training data and availability.

Luyckx and Daelemans(2008) claim that in traditional AA studies, the focus is on small sets of authors. Therefore, the AA task can be solved with high reliability, but the importance of the features used in those studies might be overestimated. They argue that considering a larger number of authors in the training set, allows to account for the variability of different features within a single topic. They also identify a second problem in AA studies, namely the ‘unrealistic size of training data’ (Luyckx and Daelemans,2008). The same concerns are also shared byKoppel et al.(2012), according to whom, especially in the AID case, it is usually assumed that there is a lot of text for each candidate author and that the unknown text is relatively long. Burrows (2006) defines 10 000 words as the ‘reliable minimum for an authorial set’.

Most of the research on AA has taken into account a large amount of available data. Juola(2015), for example, presents four case studies, involving Edgar Allan Poe, a person under the pseudonim of Bilbo Baggins, J.K Rowling and the alleged creator of the Bitcoin, Satoshi Nakamoto. The first was a case of disputed authorship of two poems between Edgar Allan Poe and his brother. The second case investigates the real authorship of a few articles claimed to be written by Bilbo Biggins, which would have guaranteed asylum to the person under this pseudonym in the USA. In the third case, the detective novel ‘The Cuckoo’s Calling’ is found to be written by J.K.Rowling, under the pseudonym of Robert Galbraith. The fourth case sees the authorship of the technical documentation about Bitcoins (wrongly) assigned to Dorian Satoshi Yakamoto. One of the factors that all these cases have in common is that there is a ‘substantial amount of text available for each case’ (Juola,2015), namely poems, novels, articles and long technical documentation.

However, in most real-life situations, especially the ones in which no literary work is involved, it is unlikely to find such a big amount of data for each author. As

(16)

a consequence, traditional approaches to AA, which are trained on a lot of data, are less reliable when applied to situations where limited data is available for each author (Luyckx and Daelemans,2008).

One of the few studies on AA for small texts is the work of Feiguina and Hirst (2007). They parsed the text and generated syntactic labels for each of the words contained in a sentence. They found that bi-grams of those labels are a helpful feature when working on short texts. Their approach led to a 99% accuracy on liter-ary texts, but proved to be less effective with simulated forensic data, highlighting how AA is strictly related to topic and genre. An interesting novelty introduced in the experiments byFeiguina and Hirst(2007) was the use of different sizes for the training data, namely 100, 200 and 500 words per text. The aim was to verify how much text is needed in order for the system to perform best. The results showed that using 500 words texts yield the best scores. Inspired by this approach, we also considered variable lengths for each text, namely 200, 500, 1 000 and 1 500 words. More details are presented in Sections3.6.1and4.2.

De Vel et al.(2001) created a corpus suitable for AA experiments, containing 156 emails from three authors, with each author contributing emails on three topics. Even though in general emails are considered to be quite short, they managed to collect 12 000 words per author for all topics. Although this study is a valid exam-ple of AA on a less studied genre, the number of authors considered remains small, and the text collected for each author is still quite long. The corpus, thus, cannot be considered representative of the genre, and the AA experiments fall into the same category mentioned above (small pool of authors and long texts).

Abbasi and Chen(2005) experiment with AA on web forum messages, which are short by nature. They perform AID both with a C4.5 classifier (which is based on a decision tree classifier) and an SVM, obtaining an accuracy of 90.10 and 97.00, respectively, on an English dataset, and 71.93 and 94.83 on an Arabic dataset. How-ever, their approach is specifically tailored to terrorism prevention and, therefore, not portable to other domains. They focus mostly on AID for Arabic data, investi-gating also how the identification performance differs between English and Arabic.

Following the approach ofKoppel and Schler(2004) discussed in Section2.2,

Sander-son and Guenter(2006) experiment with newspaper articles, and compile a dataset of 50 journalists, with a minimum of 10 000 words per author. Although the pool of authors is bigger compared to the experiments mentioned before, the amount of text per author is still in line with the ‘reliable minimum’ of Burrows (2006) and thus, large. However, they found that, when using between 1 250 and 5 000 words per text in the test set, the best performance is obtained by training the system on 5 000words.

Mohtasseb et al.(2009) present an investigation on AID on personal blogs, or online diaries, highlighting the main challenges in working with a genre that presents text properties which are different from the ones AID systems are usually trained on. The novelty of this study is the use of LIWC (Pennebaker et al.,2001) and MRC (Wilson, 1988) feature sets together with a collection of more traditional syntac-tic features. LIWC features divide words in different psychological and cognitive meaningful categories, such as positive and negative emotions or the use of cau-sation words. Similarly, MRC features consider both linguistic and psychological aspects of the words. The researchers use a corpus of 93 authors and 200 posts on average per author, with post lengths between 200 and 600 words, and use an SVM classifier. They conclude that accuracy increases when there are more words in the post. For shorter posts, however, having more posts enhances the performance. To the best of our knowledge, this is one of the few works investigating AA on diaries,

(17)

2.2 authorship verification 8

although in the form of online personal blogs. For this reason, Mohtasseb et al. (2009) is one of the works the present research is inspired by.

2.2 authorship verification

In the AV problem, given one or more documents of a single author, the task is to determine whether or not a new given text is written by this author (Koppel and Schler,2004). Since AA can be treated as a sequence of more AV problems (Koppel et al.,2012;Stamatatos et al.,2014), AA should be solvable if AV is solved (Koppel and Winter,2014).

Two different interpretations of the AV task have been developed. Koppel and Schler(2004) define AV as a one-class classification problem. They write:

If, for example, all we wished to do is to determine if a text was written by Shakespeare or Marlowe, it would be sufficient to use their respective known writings, to construct a model distinguishing them, and to test the unknown text against the model. If, on the other hand, we need to determine if a text was written by Shakespeare or not, it is very difficult – if not impossible – to assemble an exhaustive, or even representative,

sample of not-Shakespeare.

The writers highlight that a situation in which one author is suspected to have writ-ten a text, but there is no other set of alternative suspects, is a common one. In a one-class classification approach, given a known document written by author Ak and an unknown document written by author Au, the system is thus expected to determine whether Ak=Au, i.e. to recognize a given class, rather than discriminate among more classes (Hürlimann et al., 2015). In the real world, however, there is an abundance of negative examples, i.e. texts that are not written by the au-thor in question: following the example presented above, the texts not written by Shakespeare are far more than the texts he actually wrote (Koppel and Schler,2004). Nonetheless, given that these examples are available in large amounts, selecting the most representative ones might be a difficult task (Koppel and Schler, 2004). What is not-Shakespeare representative of? What are the parameters that identify a not-Shakespeare text? The notion of representativeness becomes thus problem-atic. Hence, modelling AV as a one-class classification task makes it harder to solve, mostly due to the challenges that arise in distinguishing between elements of the class and outliers (Halvani et al.,2016).

A more natural way of performing the AV task is that of modelling it as a multi-class multi-classification task, and more specifically, as a binary multi-classification task (Koppel et al.,2009;Hürlimann et al.,2015). The two classes are YES when A_k=A_uand NO when Ak6=Au(Hürlimann et al.,2015). Given a dataset with multiple problems, the way in which negative examples are created distinguishes extrinsic from intrinsic approaches (Stamatatos et al.,2014). Extrinsic approaches make use of external doc-uments (Seidman, 2013;Koppel and Winter, 2014), while intrinsic approaches do not make use of external documents (Hürlimann et al.,2015).

The most famous example of extrinsic approach is the Impostor Method, proposed byKoppel and Winter(2014). They produce a set of additional authors, the ‘impos-tors’, taken from external sources (like the Web), and ask the system if the known document is sufficiently more similar to the unknown document than any of the impostors. To measure document similarity, they select a random subset of features. The method has proved to work well even when the documents are no longer than 500words (Koppel and Winter,2014).

(18)

2.2 authorship verification 9

Luyckx and Daelemans(2008), instead, build negative examples by using fragments of all the authors in the training data, except the actual author of the known doc-ument. This method makes the approach intrinsic, and has in part inspired how negative instances are created in our approach (Section3.6).

One of the most well-known approaches for AV was introduced by Koppel et al. (2007), and it is called the Unmasking Method. The main idea behind this technique is to remove reiteratively the features that discriminate the most between the known and the unknown document and observe how the accuracy changes. If the accuracy drops significantly, it means that the two documents are written by the same author. On the contrary, if the degradation is slow and smooth, the two documents are written by two different people. The main hyphotesis ofKoppel et al.(2007) is that if the two texts are written by the same author, then their differences are reflected only in a small number of features. Although quite powerful for long documents, this method looses its efficiency with short texts. Sanderson and Guenter (2006) observed the accuracy degradation in same-author and different-author curves for short and long texts. They found that with short texts, same- and different-author curves are very similar, and that there is no significant drop in accuracy in same-author curves. The unmasking effect is thus less pronounced. Hence, the unmask-ing method is less useful when dealunmask-ing with short texts.

Authorship Verification has been the main focus of the PAN shared tasks 2013 (Juola and Stamatatos,2013), 2014 (Stamatatos et al.,2014), 2015 (Stamatatos et al.,2015) and 2020 (Bevendorff et al.,2020). PAN 2013 and 2014 shared tasks focus on Au-thorship Verification for Dutch, English, Greek and Spanish on four genres, namely essays, reviews, novels and opinion articles (Stamatatos et al.,2014). While focusing on the same languages, the 2015 shared task focuses on cross-topic and cross-genre AV, which means that the known and unknown document do not necessarily match for topic and/or genre (Stamatatos et al.,2015). Both at PAN 2013 and 2014, the best performing approaches employed extrinsic methods, proposing a variant of the im-postors method (Stamatatos et al., 2015). The top performing approaches at PAN 2015were the ones ofBagnall(2016) andMoreau et al.(2015). The former approach uses a character-level Recurrent Neural Network, which managed to perform better than SVM based systems (Dwyer, 2017). The latter approach, instead, combines different AV approaches into an ensemble method. Both of these methods, however, are computationally expensive, mostly due to the need of fine-tuning the hyperpa-rameters. On the contrary,Bartoli et al.(2015) andHürlimann et al.(2015) achieve the best results among the intrinsic methods, with significantly reduced running times. These approaches use traditional machine learning methods, namely a ran-dom forest classifier and an SVM, respectively. Among the 10 AV systems compared byHalvani et al.(2018) on new data, GLAD performed best, showing high adapt-ability to new datasets. Giving its portadapt-ability and language independent approach, we adaptedHürlimann et al.(2015) to the present study.

Since PAN 2015 currently represents the state-of-the-art on AV, we adopted the same metrics used in the shared task, i.e. c@1, AUC and their product. However, since these studies were carried out on languages other than Italian, our results are not directly comparable to theirs. Italian is one of the languages used in the AV task at PAN 2020, together with English, Dutch and Spanish (Bevendorff et al.,2020). The data used consist of fanfiction texts, but since the competition is still ongoing, no overview of the submitted systems is available at the moment. At time of writing, results are available,1

but it is not clear to which language they refer to. Therefore, a proper comparison with previous methods and results cannot be made.

1

(19)

2.3 aa for italian 10

2.3 aa for italian

Italian has rarely been taken into consideration when investigating Authorship At-tribution, and most of the research available to the public, targets AID rather than AV. To the best of our knowledge, there are only two studies dedicated to Italian AID,Tuzzi and Cortelazzo(2018) andKestemont et al.(2019). As mentioned in the Introduction, the first one is a major work focused on uncovering the real identity of Elena Ferrante, a world-renowned Italian novelist whose real name has been kept secret for over 25 years. Even though the literary mystery has been much discussed over the years, this case has not been the focus of scientific research until 2017, when an Italian research team decided to approach the problem and submit the results to the international community for debate. The various studies on Elena Ferrante’s indentity were then collected byTuzzi and Cortelazzo(2018) in an extensive book. Given the interdisciplinary background of the researchers who intervened into this study (linguists, social scientists, computer science experts, mathematicians, statis-ticians), the problem on Elena Ferrante’s identity was analyzed both from a qual-itative and quantqual-itative point of view, including purely linguistic analysis as well as computational methods. Among the computational approaches, the unmasking method was the most used (Mikros,2018;Savoy,2017), but data compression was also attempted (Lalli et al.,2006). Besides being an AID problem with a closed-set of candidate authors, and not an AV problem, the systems were once again trained on a large amount of data, i.e. novels. Tuzzi and Cortelazzo(2018) collected a cor-pus of 150 novels written by 40 different authors over a period of 30 years, with each novel including minimum 10 000 tokens. Almost all the studies on the corpus revealed that the most likely author behind Elena Ferrante is Domenico Starnone.

The second study involving Italian AID is the PAN shared task of 2018 (Kestemont et al.,2019). The task is to analyze cross-domain AID on fan-fiction texts in four lan-guages, namely English, Italian, Polish and Spanish. The best performing system for Italian (and third overall) was the one proposed byHalvani and Graner(2018). Their approach was solely base on a compression algorithm and a simple similarity measure, without involving a training procedure, and achieved a macro F1 of 0.752 for Italian.

As mentioned in Section2.2, the PAN 2020 shared task focuses on Italian as well, but the results of the submitted systems are not official yet. When published, those methods will represent the state-of-the-art on Authorship Verification on Italian data. Future work on the present study could consider those results as a reference for comparison.

(20)

3

_{D A T A}

A major issue for Italian, as for many languages, is the lack of publicly available datasets suitable for the Authorship Attribution task. In this study, we use the Fo-rumFree and Diaries datasets. The FoFo-rumFree dataset is a courtesy of the Italian Institute of Computational Linguistics “Antonio Zampolli” (ILC) of Pisa1

. The ini-tial dataset is not originally meant for Authorship Attribution. However, we adapt it to the task at hand. This dataset is not currently publicly available. Diaries is a new dataset compiled from online sources. The two novel datasets represent one of the major contributions of this work to the field of AA for Italian. Although in the present study they are specifically structured for Authorship Verification tasks, the datasets could be easily adapted to work for other AA tasks (see Section2.1.1).

Since the way the task is framed affects the structure of the data, and the kind of data used motivates the chosen methodology, we include in this chapter a more in-depth analysis of the data and datasets. In the following sections, an overview of the two datasets is presented (Sections 3.1 and 3.2). Section 3.3 includes the essential differences between the two datasets, mostly in terms of number of authors and texts associated to them. In Section 3.4, the challenges of working with this kind of data are described. Section3.5illustrates the pre-processing steps that the data underwent, while Section 3.6 describes the datasets formatting process. Finally, Section3.7explains how the training and test set are created.

3.1 forumfree dataset

ForumFree2

is a platform which allows users to create their own forum. The Fo-rumFree dataset is a big collection of 167 forums, containing 6 039 801 posts and comments in Italian, which span 31 different topics. The initial dataset was not originally meant for Authorship Attribution. However, since the authors of the comments and their gender are annotated, a subset of the dataset could be refor-matted in such a way that it could be used for Authorship Verification experiments. The subset was restructured according to the PAN 2015 format (Stamatatos et al., 2015), commonly used in AA tasks. More details about the reformatting process are presented in Section3.5and 3.6. Hereafter, ForumFree dataset refers to the subset chosen for the present study.

The ForumFree dataset covers two topics, Medicina Estetica (Aesthetic Medicine) and Programmi Tv (Tv-programmes), the latter corresponding to the topic ’Celebrities‘ in the original dataset. A third topic, Mix, is simply the union of Medicina Estetica and Programmi Tv. It was created to run experiments where the known-unknown text pairs can be either same- or different-topic, and to run cross-topic experiments. More details are presented in Section 4.3. Table1 contains an overview of the Fo-rumFree dataset after the preprocessing steps described in Section 3.5. The table displays, for each topic, the number of female (F) and male (M) authors considered, and their total (Tot); the total number of forum comments that belong to that topic (# Comments); the average number of words per author (Avg words per author) and per comment (Avg words per comment), and the average number of comments written by each single author (Avg comments per author).

1

http://www.ilc.cnr.it/

2

https://www.forumfree.it/

(21)

3.2 diaries dataset 12

Topic # Authors # Comments Avg words per author Avg comments per author Avg words per comment F M Tot Medicina Estetica 33 44 77 56198 63 661 48 Programmi TV 78 71 149 153019 32 812 22 Mix 111 115 276 209217 41 791 29

Table 1: An overview of the ForumFree dataset. The Avg words per author is computed by cal-culating the average number of words per individual author, summing the different averages and dividing this sum by the total number of authors. The Avg comments per author is calculated by computing the mean of the number of comments associ-ated to each author and the Avg words per comment is computed by dividing the sum of the lengths of each comment by the total number of comments.

Two main reasons motivated the topic choice: the more or less balanced distribution (compared to other topics in the original dataset) of female and male authors within the topic and the total number of authors who wrote in those forums, which was considered to be large enough to run AA experiments.

3.2 diaries dataset

The Diaries dataset is a collection of diary fragments included in the project Italiani all’estero: i diari raccontano3

(Italians abroad: the diaries narrate), which is in turn a selection of a bigger diary collection of the Italian National Diary Archive Founda-tion4

. These fragments are the diaries, letters and memoirs of Italian people who lived abroad between the beginning of the 19thcentury and the present day. The Diaries dataset is much smaller than the previous one, with 1 428 stories written by 266 authors. The website the diaries are taken from does not contain any gender information about the authors. Hence, gender was manually annotated. One of the authors was anonymous and it was not possible to determine his/her gender from the text. This author was discarded, and his/her texts were not included in the experiments. The authors considered are thus 265, for a total of 1 422 stories. An overview of the dataset can be seen in Table2. The structure of the Table resem-bles the one of Table 1. However, since for each diary fragment (story) more than one topic was indicated, the topic division is blended and topic categories were not taken into consideration.

Topic # Authors # Stories Avg words per author Avg stories per author Avg words per story F M Tot / 77 188 265 1422 462 5 477

Table 2: An overview of the Diaries dataset. The Avg words per author is computed by calcu-lating the average number of words per individual author, summing the different averages and dividing this sum by the total number of authors. The Avg stories per author is calculated by computing the mean of the number of stories associated to each author and the Avg words per story is computed by dividing the sum of the lengths of each story by the total number of stories.

As well as for the ForumFree dataset, the Diaries dataset was reformatted according to the PAN format for AA tasks.

3

https://www.idiariraccontano.org

4

(22)

3.3 differences between datasets 13

Figure 1: Number of female and male au-thors in the datasets

Figure 2: Proportion of texts written by male and female authors

It is important to note that the texts contained in the dataset span over a period of more than a century (1901-2014). Moreover, age and level of education of the authors is not controlled for, and thus not homogeneous. More details about the challenges that diaries pose as a genre are presented in Section3.4.

3.3 differences between datasets

As opposed to the datasets usually used for AA tasks (discussed in Section2.1), the ForumFree and Diaries datatsets contain both a large number of authors and short texts for each author. However, besides the genre of the text contained (web forums comments and diaries), they present some differences.

Figure 1 displays an overview of the number of female and male authors across the ForumFree topics and the Diaries dataset. It is important to note that the topic Mix represents the entire ForumFree dataset, since it is the union of Medicina Es-tetica and Programmi TV. As can be seen, the number of female authors exceeds the number of male authors only when considering the Programmi Tv topic. In all the other cases, male authors prevail. However, while for the Medicina Estetica, Programmi Tv and Mix topic the number of female and male authors is approxi-mately comparable, for the Diaries the number of male authors is more than twice the number of female authors. Figure 2 shows the percentage of stories written by female and male authors in each of the topics and datasets. It can be noted that the most balanced dataset in this case is the Medicina Estetica subset, with approximately 45% of the texts written by female authors and 55% written by male authors. In all the other cases, the texts written by females are in larger amount than the ones written by males. In the Diaries dataset there are less female than male authors, but females are more prolific than males. Male authors, which rep-resent the 70% of total authors in the dataset, write, in fact, roughly 35% of the texts.

Figure3represents the average number of words per text compared to the average number of texts per author in all the topics and datasets. Note that the exact num-bers are reported in Tables1and2. Programmi Tv has the shortest amount of words per text, but the highest number of texts per author. In the diaries, instead, single authors write fewer stories, but their length reaches roughly 500 words each, on av-erage. These characteristics are in line with the genre of the texts. Forum comments are usually short, with an average of 29 words per text (see Table1), while diary

(23)

3.4 data challenges 14

Figure 3: Comparison between the average number of words per text and the average num-ber of texts per author across the datasets

fragments are longer. Comparing the two datasets (Mix and Diaries), thus, it can be said that the ForumFree dataset contains a larger number of authors and a higher number of texts than the Diaries one, but the texts are significantly shorter than the diaries fragments. On the contrary, the Diaries dataset contains fewer authors and fewer texts overall, but the number of words contained in the texts is higher. Hence, the average number of words collected per author is also higher (462, as opposed to 41 words per author in ForumFree).

3.4 data challenges

In Sections2.1 and Section2.2 we have seen how the research on Authorship At-tribution and Verification has experimented with datasets whose common charac-teristic is that of containing text in large quantity, such as poems, novels or big collections of articles or technical documentation. Besides their availability, what all these texts have in common is the fact that they are expected to be well written. It is very likely, if not mandatory, that all these documents are submitted to proof-reading before being published. As a consequence, grammatical and syntactical mistakes are significantly reduced, especially when comparing them to texts of a more personal nature such as the ones used in this study.

For the ForumFree data, topic classification is available (Medicina Estetica and Pro-grammi Tv). As explained in Section4.1, the model used in this study is n-gram based, which means that it tends to capture a lot of lexical information. We assume that the documents belonging to the same topic talk about the same thing and that the lexicon captured by the system will be more similar for same-topic than for different-topic documents. This diminishes the chances of overfitting, since the classification is less likely reduced to a topic categorization task. However, forum comments present some challenges. Primarily, comments are short (see Table1for more details). If the author of a comment does not comment multiple times, the information contained in the text is not enough to extract the author’s writing style characteristics. To account for the shortness of texts, we concatenate texts written by the same author into a single string, so that a larger amount of words is associated to that author. When it is not possible to reach the desired number of words for a sin-gle author (minimum 400, maximum 3 000), the author and the texts are discarded. More information about this process is contained in Section3.6. Secondarily, web forum comments often contain noise. Consider the following example:

(24)

3.4 data challenges 15

(1) Noiiiiaaaaaa E cmq anche lori non mi piace molto giocatrice e furba dice e ritratta. Parla male del gruppone e poi con loro fa i complimenti. Per Ivan mi dispiace.

Noiiiiaaaaaa might be interpreted either as contraction of ‘No, ja’, or as a misspelling of the word ‘Noia’. The former is an expression typical of the Italian from the Cen-tre and South of Italy, and would be translated to English as ‘No, come on’. The latter would be traslated to English as ‘What a bore’. In both cases, the word de-viates from standard Italian. Moreover, Example(1)contains the abbreviation cmq, which stands for the word ‘comunque’ (‘however’ in English). These characteristics make the text challenging, because a system needs to cope with this language, and find a way to generalize from it. Such noise is very typical of online user-generated content, and began to come to the fore in Natural Language Processing and Compu-tational Linguistics when people began to properly experiment with social media texts (Baldwin et al.,2013;Eisenstein,2013b). As previously discussed in Section2.1, by using character level n-grams we are capturing ‘a bit of everything’ (Kestemont, 2014), since they are sensitive both to the content and the form of a word.

The Diaries dataset, instead, presents different challenges. In his study on blog-ger identification,Mohtasseb et al.(2009) analyzes how the diaries language differs from other genres. Although the analysis is carried out on online diaries, the con-clusions the author reaches are valid also for the more traditional version of diaries that we use in our corpus. Firstly, diary fragments are not written in response nei-ther to a post, nor a person, as it happens with web forums. The authors of the texts spontaneously write over their life and their experiences, including details, names, places. To prevent that the system learns to associate a certain name or location to a certain author, generating a topic bias, we consider Named Entity Recognition and label conversion a necessary step to take in our study. The process is described in Section3.5. As argued byMohtasseb et al.(2009), the text in news columns might look similar to personal blogs, as it ‘comments about an event, opinion or experi-ment’. However, the text of a diary is not specifically directed to anyone, and thus there is no pre-determined subject the author talks about. Secondly, the authors of the diaries are more likely to use words that express how they feel, their mood and their opinion. For these reason, they represent the perfect candidate for the analysis of psychological traits and changes of the author (Cohn et al.,2004). Cohn et al. (2004) observe also how diaries can be used to track the changes in writing style of a single author over time. Future work could take time into account as an additional dimension to analyze in our dataset.

In our case, however, the diaries language is even more complex. Italian presents, in fact, many challenges in itself. The diary fragments collected in the dataset were written between 1901 and 2014. The language itself has changed enormously over a century (Bolasco and Canzonetti, 2003), and it is still in continuous evolution. Moreover, Italian differs greatly also across regions (Cerruti,2011). These factors make the AV task easier, since differentiating between two authors is less difficult if they present very different writing styles due to such aspects (i.e. time and place) (Bobicev et al.,2013). However, the system might be prone to memorizing this in-formation instead of learning how to generalize from it.

In the following, an example of a passage taken from a diary fragment written in 1905:

(2) L’indomani dopo questa batuta con mezza pagnota sotto il braccio, mi aviò lungo il canale in cerca di lavoro e quando mi presentavo al caposquadra mi guardavano da capo a piedi e per la verità potevano vedere solo un ragazzo pocco svilupato, ma con un po di brio e non mi dicevano altro che per il momento non vi era posto, ripasarci, e così pasava la prima giornata senza

(25)

3.5 pre-processing 16

esito, ma l’idomani senza scoraggiarmi ripresi la mia ricerca ma questa volta con esito positivo avevo trovato già lavoro. In chè cosa questo lavora consis-teva? [...] Ma la buona gente che la notte mi accompagnavano erano paesani e quindi l’indomani si premuravano a informare in paese che Giovanni Arru era diventato ciecco, e senza tanti comprimenti i familiari la notizia la co-municavano alla mia Povera Mamma, ma per fortu[na] il giorno prima di questa notizia mia Mamma aveva ricevuto un piccolo vaglia che erano che pocchi soldi che si risparmiavano e quindi rispose che non credeva alla no-tizia in quanto io non avevo comunicato niente e perché aveva anche ricevuto il vaglia, intanto quella povera donna [restava] nel dubio e quindi non tanto tranquilla sino a ricevere mie notizie.

The grammatical errors (pagnota instead of pagnotta, aviò instead of avviò, pocco in-stead of poco, etc.), can be either due to an author’s mistake, to his/her level of education or to the Italian spoken at that time. The possibility that the orthogra-phy is influenced by the spoken form is not new in research. Eisenstein (2013a) observe how certain types of contractions and apparent orthographic errors in on-line tweets are strictly correlated with the phonology of certain American English dialects. However, it is evident to a native speaker how in Example (2), the syn-tactic structure seems archaic and obsolete, with very long sentences and poorly connected clauses.

Moreover, the lack of topic classification in the Diaries dataset makes the AV task more complex. In this case, the author’s specific characteristics would not be the most discriminative features. Lexical features would prevail, and there would be a high risk to go from an AV to a semantic or topic categorization task. In other words, the system would be more prone to classify the instances according to whether or not the two texts talk about the same topic.

The language used in the diaries, together with the lack of topic classification, in-spired a set of experiments in this thesis performed by bleaching the text (van der Goot et al., 2018). The aim is trying to flatten the time, space and age language differences mentioned above and to avoid topic biases. Creating an abstract repre-sentation of the words, makes the AV a language independent task. The bleaching method is explained in more details in Section3.5.

3.5 pre-processing

Prior to its use as input for the GLAD system (Hürlimann et al., 2015), the text underwent slight modifications.

forumfree Little data preprocessing was performed on the ForumFree dataset. Many comments contained only the word up, which is usually used on the internet to give visibility again to a post that was written in the past. Those comments were removed from the dataset, together with their authors when this was the only text associated with them.

diaries Named Entity Recognition (NER) was performed, instead, on the diaries previous to their use as input for GLAD. As explained in Section 3.4, the stories narrated in the diaries are characterized by a very personal nature, which means that many proper nouns and names of locations are used. To avoid relying on these explicit clues, which are strong but not indicative of personal writing style, proper names, locations and organizations were replaced by their corresponding

(26)

3.5 pre-processing 17

labels, namely PER, LOC, ORG. Extracts(3)and(4) display an example of original and label-converted text, respectively. Named Entity Recognition was performed by using the spaCy library (Explosion,2017), which offers an Italian model trained on the Italian UD ISDT Corpus5

and on WikiNER (Nothman et al.,2013).

(3) Il 25 novembre del 1961 salii a bordo dell’aereo che ci avrebbe portato verso nord; portavo con me 10,5 kg di bagagli come consentito, e di me ero pronto ad abbandonare il clima estivo dell’emisfero sud con quello invernale di New York.[. . . ]Atterrammo a New York alle ore 2 del mattino del 26 novembre. [. . . ] Non appena scesi dall’aereo, vidi il viso sorridente di Teresa che mi aspettava fuori dalla folla per accoglierci.

(4) Il 25 novembre del 1961 salii a bordo dell’aereo che ci avrebbe portato verso nord; portavo con me 10,5 kg di bagagli come consentito, e di me ero pronto ad abbandonare il clima estivo dell’emisfero sud con quello invernale di LOC. [. . . ]Atterrammo a LOC alle ore 2 del mattino del 26 novembre. [. . . ]Non appena scesi dall’aereo, vidi il viso sorridente di PER che mi aspettava fuori dalla folla per accoglierci.

The original labels used by the spaCy NER model are four, namely PER, LOC, ORG, MISC (miscellany). However, we observed that many words were incorrectly tagged as MISC. Hence, we did not replace those words with MISC label. Moreover, dates were not accounted for.

As described in Section 3.4, to account for the lack of topic classification and the language used in the diaries, a set of experiments was performed by bleaching the text prior to its input to the GLAD system. The bleaching method was proposed by van der Goot et al.(2018) to solve the cross-lingual Gender Prediction task. They transform words into an abstract representation that deviates from lexical forms, and prove that those features are useful to identify the gender of an author across texts of different languages. We partially modify the original feature extraction process by leaving out two of the features (i.e. PunctC and Vowel-Consonant). We use four features:

• Shape Uppercase letters were transformed into ‘U’, lowercase letters into ‘L’, digits into ‘D’ and everything else into ‘X’.

• PunctAEmojis were transformed into ‘J’, emoticons into ‘E’, ‘P’ was used for punctuation and alphanumeric characters were replaced by ‘W’. In case of more consecutive alphanumeric characters, they were collapsed into a single ‘W’.

• LengthThe number of characters of each word, preceded by 0 when the words contains less than 10 characters.

• FrequencyEach word is represented by the log of its frequency in the dataset. We concatenate the four features to get the abstract representation of the text. It is important to notice that we transformed the already named-entity converted text. In the following, an example of the transformation is presented. Example(5)contains the named-entity-converted text, while example(6)contains its abstract representa-tion.

(5) Ho fatto un lungo viaggio con il ORG e sua figlia , fino al sud dell ’ LOC . (6) ULW025 LLLLLW056 LLW029 LLLLLW055 LLLLLLLW076 LLLW038

LLW029 UUUW030 LW019 LLLW036 LLLLLLW064 XP0110 LLLLW046 LLW027 LLLW033 LLLLW046 X019 UUUW030

5