Cross-genre and within-genre author profiling: P

(1)

Cross-genre and within-genre author profiling:

Predicting the gender & native language of users on

Medium and Twitter

M.M. Kuijper Master thesis Information Science M.M. Kuijper s2380250 August 30, 2018

(2)

A B S T R A C T

Author profiling is the task of predicting author characteristics based on a given text. It is used more and more often in (digital) marketing and forensics. The aim of this thesis is to investigate the author profiling tasks of native language and gender prediction using Twitter and Medium data in both within-genre and cross-genre settings. Both statistical and neural models as well as multi-task learning are used and their performance compared. The two author profiling tasks are each subdivided into four subtasks (2 within-genre and 2 cross-genre tasks). The best model for each task always outperformed the random baseline. Cross-genre tasks still achieved lower accuracies than within-genre tasks, however, multi-task learning helped in cross-genre gender prediction and outperformed other neural models in some of the tasks as well, which makes it a promising avenue for further research. Moreover, SVMs worked best in three of the eight subtasks, all of which were within-genre tasks. Other models that performed high on at least one of the eight tasks were Decision Trees, Gradient Boosting and KNN. This suggests that we should not overlook these alternatives to the SVM as powerful models in NLP tasks when using relatively small datasets.

(3)

C O N T E N T S

Abstract i 1 _introduction 1 1.1 Motivation . . . 1 1.2 Research questions . . . 2 1.3 Thesis overview . . . 2 2 _{related work} 3 2.1 Author profiling . . . 3

2.2 Native language identification . . . 3

2.3 Cross-genre author profiling . . . 4

2.4 Author profiling and multi-task learning . . . 4

3 _approach 6 3.1 Models . . . 7

3.1.1 Support Vector Machines . . . 7

3.1.2 Decision Trees . . . 7 3.1.3 K-Nearest Neighbors . . . 7 3.1.4 Gradient Boosting . . . 8 3.1.5 Neural networks . . . 8 Perceptron . . . 8 LSTM . . . 8 BiLSTM . . . 8 3.2 Techniques . . . 8 3.3 Tools . . . 10

4 _{data and material} 11 4.1 Collection . . . 11

4.2 Annotation . . . 13

4.3 Processing . . . 13

4.4 Initial linguistic pre-study . . . 14

5 _experiments 18 5.1 Statistical Machine Learning Models . . . 18

5.1.1 SVM . . . 19 Parameters . . . 19 Features . . . 19 5.1.2 Decision Trees . . . 21 Parameters . . . 22 Features . . . 22 5.1.3 Gradient Boosting . . . 22 Parameters . . . 22 ii

(4)

CONTENTS iii

Features . . . 22

5.1.4 K-Nearest Neighbors . . . 22

Parameters . . . 22

Features . . . 23

5.2 Neural Machine Learning Models . . . 23

5.2.1 Word embeddings features experiment . . . 23

5.2.2 LSTM . . . 25

5.2.3 BiLSTM . . . 25

5.3 Model combination . . . 25

5.3.1 Stacking SVM probabilities . . . 25

5.3.2 Ensembles . . . 25

5.4 Task dependency experiment . . . 26

5.5 Multi-task learning . . . 26

5.6 Feature analysis . . . 27

6 _{results and discussion} 28 6.1 Statistical and Neural models . . . 28

6.1.1 Results SVM . . . 28

6.1.2 Final results best SVM models (dev and test) . . . 30

6.1.3 Results other statistical models . . . 30

6.1.4 Results word embedding experiment . . . 33

6.1.5 Results LSTM . . . 33

6.1.6 Results BiLSTM . . . 34

6.1.7 Best results neural models . . . 35

6.1.8 Results ensemble . . . 36

6.1.9 Cross validation and mixed models . . . 36

6.2 Multi-task Learning Experiments . . . 37

6.2.1 Results task dependency experiment . . . 37

6.2.2 Multi-task learning . . . 40

6.3 Final conclusions models . . . 41

6.4 Feature analysis . . . 44

7 _conclusion 51

Appendices 55

(5)

1

_{I N T R O D U C T I O N}

Author profiling is the task of (automatically) inferring certain author characteristics from a given text. The field has been around for more than a hundred years, but only recently have attempts been made to automatically assign these characteristics with the help of machine learning approaches. In the earlier days, the field was mostly focused on classifying the works of famous authors, such as Shakespeare and Marlowe, manually, using features such as word and sentence length (Stamatatos,2009). Nowadays, it is used to classify social media texts for marketing purposes and criminal evidence for forensic investigations amongst other things.

Native language identification is a subtask in author profiling in which the L1 (native lan-guage) of a subject is predicted based on their use of their L2 (second lanlan-guage). In most research, including this thesis, the L2 is English, but studies have also used Chinese, Finnish and Norwegian (Malmasi and Dras(2014a,b);Malmasi et al.(2015)) amongst others as L2. The main question of this task is whether language transfer, subtle (or not) linguistic characteristics transferred from the L1 to the L2, is evident and can be used to solve this problem, or whether other, more coarse computational features, are more effective. The latter has often been shown to be the case.

Gender identification is also a subtask in author profiling which aims to detect the gender of the author of a text through their use of stylistic features. Gender is one of the most commonly researched author profiling subtasks and has often been included in the PAN shared tasks. In this task, it is interesting to research whether the saying ’Men are from Mars, women are from Venus’ is true, in the sense that men and women speak in such different ways that they can be told apart from their writings alone, in other words, that there is such as a thing as a ’genderlect’ (Tannen, 2014). Commonly cited features of ’female speak’, used more by women than by men, are for instance: hedges, minimal responses and questions, whereas a competitive, even adversarial style is said to be more common in male writers or speakers (Coates,2004).

This thesis will look at both within-genre and cross-genre classification. Within-genre looks at data from one genre only, whereas in cross-genre classification tasks, the machine learning model is trained on one domain and tested on another. This is used to discover if models port to other domains easily, which would facilitate their use in production environments, with variable sources of data. Oftentimes in NLP tasks, cross-genre studies find that within-genre results and cross-genre results vary extensively. Cross-genre models usually see a steep decline in accuracy. This may also be caused by the relative novelty of this particular research method.

1.1 motivation

Being able to predict the gender and native language of an author would be of help to digital forensics, where this knowledge can be used to track down criminals. Moreover, NLI might provide important insights for second language acquisition research and teaching. Under-standing whether there are differences between distinctive features marking gender in an L2

(6)

1.2 research questions 2

for speakers of different L1s, might provide insights into the cultural background and tradi-tions of the L1. Finally, working with data from multiple genres is also beneficial because in reality we often do not just have data from one genre at our disposal. Knowing whether sys-tems work cross-genre or are only valuable with in-genre data, is therefore important. Building solid cross-genre systems would also overcome problems pertaining to limited data.

1.2 research questions

The main research question that will be answered in the present study is:

To what extent can we build a system that detects the gender and native-language of a speaker? This question can be subdivided into the following four subquestions:

1. To what extent can we build a system that works well across genres?

2. What is the best way to model multiple author characteristics? Is a multi-task setup beneficial or is it better to model gender and native language separately?

3. What are the distinctive features that distinguish people by their gender or native lan-guage and are these distinctive features different per genre?

4. Can we distinguish different distinctive characteristics that characterise gender per na-tive language ? Similarly, are nana-tive language characteristics expressed differently per gender?

1.3 thesis overview

In Chapter2, related works will be described and their relevance will be made clear. Subse-quently, in Chapter3, the general approach will be described. Furthermore, in Chapter4, the data collection, annotation and processing steps will be demonstrated, as well as a preliminary linguistic study. In Chapter5, the experiments of this study will be described. Next, Chapter 6 will discuss the results of this study, followed by Chapter 7 in which conclusions will be drawn and future work will be discussed.

(7)

2

_{R E L A T E D W O R K}

2.1 author profiling

The Author Profiling Task at PAN 2017 investigated the prediction of language variety (in En-glish, Spanish, Portuguese and Arabic) and gender (Rangel et al.,2017). Twenty-two teams participated and the best overall results were obtained using an SVM trained with combina-tions of character n-grams and tf-idf word n-grams. The runner-up used Logistic Regression. The results obtained during this shared task suggest that even in 2017, SVMs and logistic regression are still very much state-of-the-art, with the best teams using these models suc-cessfully with high resulting accuracies (Rangel et al., 2017). In addition, the best team for Portuguese, with the best gender accuracy score of the competition, used neural networks. This may also suggest that neural networks should not be underestimated and might have a positive effect on performance in gender detection tasks.

2.2 native language identification

Native-language Identification (NLI) deals with the automatic identification of the native lan-guage (L1) of an individual based on their writing or speech in another lanlan-guage (L2) (Malmasi et al.,2017). NLI research attempts to uncover language use patterns common to certain groups of speakers that share the same native language (Malmasi et al.,2017). In 2017, the NLI Shared Task was held which comprised of several different native-language identification tasks. One of the tasks was based on written data (essays), and another task combined both written and spoken data. They used the TOEFL iBT corpus for both the written and spoken text. The data consisted of 11 native languages: Arabic, Chinese, French, German, Hindi, Italian, Japanese, Korean, Spanish, Telugu, and Turkish. 19 teams participated in the task. For the essay-only task, ItaliaNLP Lab had the highest f1-score with 0.8818. They used a stacked classifier with lexical and syntactic features. They were followed by CIC-FBK with an f1-score of 0.8808. They used an SVM with log-entropy weighted n-grams and syntactic features. The third place was obtained by Groningen with an f1-score of 0.8756. They used a Linear SVM with character n-grams (1-9). For the task combining written and spoken data, the best team had an f1-score of 0.9319. They used character-level string kernels and i-vector features. The overall results suggest that techniques such as stacking are very effective and lexical n-grams were considered the best single feature type of the competition (Malmasi et al.,2017). Moreover, it was found that traditional classifier models continue to perform well, as they consistently outperformed deep learning approaches (Malmasi et al., 2017). Moreover, the task which combined both writing and speech yielded the best results. So the results of this shared task suggest that combining both written and spoken data gives the highest performance when doing native language identification tasks (Malmasi et al., 2017). This also supports the hypothesis that having more data from an author is better, regardless of which modality that data is in.

(8)

2.3 cross-genre author profiling 4

2.3 cross-genre author profiling

Medvedeva et al. (2017) investigated whether the Twitter-trained model that won the PAN 2016cross-genre task on author profiling was truly cross-genre and could therefore also yield good results on other test sets (non-blogs). The best system was created by GronUP and was an SVM. They also investigated whether training the model on different non-Twitter data, would also perform well on the test set. They used several datasets to test the cross-genre stability of their system. They found that their model also performed well on other test sets, as long as the genre-gap was not too large (Medvedeva et al.,2017). This was supported by good performance on blogs and worse performance on reviews. They stated that if genres are similar enough, it might be better to create a larger cross-genre training set than smaller in-genre ones, so as to boost performance with more data (Medvedeva et al., 2017). They also found that training on different social media data yielded worse results in comparison to training on twitter-data and testing on blogs. The social media data was however of low quality, so they concluded that these results were too unreliable to draw any conclusions from (Medvedeva et al.,2017).

Malmasi and Dras (Malmasi and Dras,2015) conducted a study in which they performed native language identification on different corpora to investigate whether there are corpus and/or genre independent language transfer features. They conducted within-corpus NLI and evaluated the results against results of their model on other existing corpora. Then they conducted cross-corpus NLI to investigate how effective corpus independent features are if they exist at all. They compared it to the TOEFL11 corpus which has 9 languages in common. Their approach of both testing the within-corpus efficiency of certain linguistic features and then testing it cross-corpus is interesting and will also be used in the present study. Further-more, in their case study they talk about how they effectively distinguished Japanese learners by means of sentence length, word order and POS order. These features will also be used in the present study. As for their results, their within-genre cross-validated system had an accuracy of 0.650 with 11 class classification, whilst the two cross-genre experiments with 9 classes scored much lower with 0.335 and 0.284. The cross-genre experiment uses only 9 classes because the two corpora they used only had 9 of the 11 classes in common.

Ucelay et al. (2015) conducted a cross-domain author profiling study in Spanish on formal and informal texts. The PAN-CLEF 2013 and 2014 corpora as well as the SpanText corpus were used. SpanText is formal, the other two corpora also contain informal text (or are entirely infor-mal). The texts were classified for gender and age. To do this, they used bag-of-words features and character trigrams as features. They also compared second order attributes, boolean and tf-idf features and social profile features (sistema de perfile). Second Order Attributes worked best as features, but tf-idf features also worked well. The present study will also use (amongst other things) tf-idf features (including character trigrams).

2.4 author profiling and multi-task learning

Wang et al. (2016) investigated the benefits of multi-task learning on gender and age prediction on the Chinese social media platform Weibo. They used a Multi-task Convolutional Neural Network (MTCNN) to predict gender and age simultaneously. Their model significantly out-performed state-of-the-art baselines in their experimental setup. The lowest layer of the neural

(9)

2.4 author profiling and multi-task learning 5

network learns the general representation in common between both gender and age, whilst the final layers generate features that are unique and task specific, resulting in task-specific output layers. Their parameters were optimized using grid search, after which they set the size of word embeddings to 50, the number of filters to 250. They used Adam as optimizer and mini-batch gradient descent with size 32. The hyperparameter C was set to 120 to balance two loss functions. They experiment both with character-level and word-level representations, which is interesting considering the segmentation difficulties present in the Chinese language. They used three models for comparison: a majority or average class (depending on the task) as baseline, an SVM trained on unigram features, two CNNs one for each task. Both at character and word level, their system outperforms all three comparison models.

Benton, Mitchell and Hovy (2017) investigated the prediction of mental health and gender in a multi-task learning setting. The main tasks that the paper investigates are the prediction of neurotypicality (which is the absence of any mental health conditions), anxiety, depres-sion, suicide attempt, eating disorder, panic attacks, schizophrenia, bipolar disorder, and post-traumatic stress disorder (PTSD). In addition, the above-mentioned tasks as well as gender also serve as auxiliary tasks. The idea is that there is an interaction between variables which is lost when the two tasks are modeled separately but which may boost accuracy when modeled together. It has been found that certain conditions often occur together and there is also a correlation with demographic factors (such as gender). Using the multi-task learning setting, data and interactions from different variables can be shared across predictions. The multi-task learning models (with and without gender as auxiliary, without just uses other mental condi-tions as auxiliary) often outperform the single task learning and Logistic Regression models in terms of AUC. This is especially evident in the tasks for which there is little data (PTSD, bipolar disorder, and panic attacks). Here, the AUC is improved by also predicting other re-lated or regularly co-occurring conditions such as depression and anxiety, for which there is more data. They also found that leveraging gender as auxiliary task generated more predictive models, although they were not statistically significant. What is interesting and arguably not in favor of using this method in my study, is the fact that the multi-task learning model did not perform as well on gender as the other two (control) models.

(10)

3

_{A P P R O A C H}

The aim of this thesis is to predict the native language and gender of an author in two different settings: a cross-genre setting, in which the training data is from one genre, and the test data from the other, or within-genre, where all data is from one genre. This allows for the subdivision of the problem into 8 subtasks, namely:

1. Cross-genre native language prediction, training on Twitter and testing on Medium, 2. Cross-genre native language prediction, training on Medium and testing on Twitter, 3. Within-genre native language prediction of Twitter data only,

4. Within-genre native language prediction of Medium data only,

5. Cross-genre gender prediction, training on Twitter and testing on Medium, 6. Cross-genre gender prediction, training on Medium and testing on Twitter, 7. Within-genre gender prediction on Twitter data only, and finally

8. Within-genre gender prediction on Medium data only.

Before making any predictions, first a couple of other steps needed to be taken care of, namely: 1. Collect and annotate data

2. Filter out any bad instances (incorrect class labels, non-person accounts, authors with no text)

3. Clean data

4. Preliminary linguistic study

These steps will be described in the following chapter. Once these initial steps are done, sta-tistical and neural-based machine learning models can be built for each of the 8 tasks. The statistical models will take tf-idf feature vectors as input and will predict either gender or native language as output. The neural models will take word embeddings as input, the dimen-sions of which and the data on which they are trained is decided based on experimentation with various types of embeddings: pre-trained embeddings, learner data embeddings, learner embeddings initialized on the pre-trained feature space, a concatenation of learner and regular embeddings, a PCA-reduced version of that concatenation, and retrofitted embeddings. The output is again either the gender or native language of the author. Furthermore, multi-task learning is used to determine whether one task can learn from the other task, using a shared loss function. In addition, averaging and ensembling techniques will be used to combine neu-ral models into combination models, which are hypothesized to yield better accuracies. The performance of the different models will be compared on a per-task basis. In addition, the within-genre and cross-genre performance will be compared. Finally, feature analysis will be done to determine which features distinguish different classes, both for the full model, and by performing task 1 in all classes i of task 2 and vice versa.

(11)

3.1 models 7

3.1 models

In order to perform the eight subtasks and to answer the research questions, a selection of models were used. Both statistical and neural machine learning models were built. The specific models that were used are described below.

3.1.1 Support Vector Machines

Support Vector Machines (SVM) are a statistical machine learning algorithm which tries to find the optimal hyperplane with which to separate the data. The so-called ’support vectors’ are the data points closest to the hyperplane. It tries to create the largest margin between the support vectors and the hyperplane. The C value dictates the way the hyperplane is established. With a low value for C, the hyperplane is a smooth line, which may not classify all points correctly on the training data but is more general. For high values of C it tries to classify everything correctly and therefore is more of a squiggly line. High values of C might overfit on the test data because they are specifically tuning the line to the training data. Gamma determines how much influence one single training example has on deciding where to draw the hyperplane. High values of gamma only take close by points into consideration, low values of gamma take even further away data points into consideration. High value of gamma creates a squiggly line, low value generalizes more and creates a more linear line. When the data is not linearly separable, SVMs can map the data into a higher dimensional space (kernel trick) so the data becomes linearly separable.

3.1.2 Decision Trees

Decision trees are a statistical machine learning algorithm which separates data in a tree-like manner. At the decision nodes, variable values are used to split data into branches. Leaf nodes are the end nodes at which a prediction is found. Entropy is used to calculate the ’purity’ of a sample at a particular node, if the data at a particular point is all from the same class, the entropy is zero, if the data contains an equal number of instances from all classes, entropy is one. Decision trees are built by splitting the data on various variables and variable values and calculating the information gain (reduction of entropy) in each option in comparison to the parent node. The higher the information gain (and the lower the entropy), the better. The number of samples per leaf and the depth of the tree can be used (amongst other things) to optimize the tree. Usually a simple tree is better than a more complex one (if accuracy is similar) because complex trees may overfit on test data.

3.1.3 K-Nearest Neighbors

In K-Nearest Neighbors, the class of a data point is predicted by taking the most common class of the K-nearest (training) data points. So if K = 3, the three data points closest to your test instance will determine which class it will get.

(12)

3.2 techniques 8

3.1.4 Gradient Boosting

Gradient Boosting is a machine learning algorithm which iteratively builds a set of ’weak’ decision tree models which together form a stronger ensemble model. Its underlying ensemble mechanism is boosting. Boosting is an ensemble technique which sequentially builds better models by using the predictions of the previous model to make more informed decisions in the next. It adds more weight to incorrectly predicted samples in the next model for instance. Finally, it combines all of these models and gives a certain weight to each one of them to combine it into one single predictor.

3.1.5 Neural networks Perceptron

A perceptron is a feed forward neural network which takes inputs and feeds them through a network of X hidden layers. The input is multiplied by its weight and updated with a bias, once it has passed through all of the hidden layers, it can be fed to an activation function which then yields an output layer which returns a prediction.

LSTM

Long-short Term Memory (LSTM) is a type of Recurrent Neural Network (RNN). RNNs take time and sequence information into account and are therefore suitable for text data, in which previous context is important. Humans, after all, take into account the previous context when responding to or understanding an utterance. LSTMs are basically an extension of RNNs in that they can selectively learn and remember previous context for an extended period of time therefore being more suitable when data that is further in the past needs to be incorporated. BiLSTM

A BiLSTM is an LSTM which goes both ways, e.g. when dealing with text data it processes a sequence of inputs (e.g. a sentence) from beginning to end and from end to beginning. This way, you have all of the sequential information in both directions. At first it might seem odd, because as we write a sentence, future data is not yet present. Yet when understanding data, we can often skim certain words merely because its context words (either before or after the target word) are clear enough. Similarly, in some cases we only understand a certain concept after reading the entire paragraph.

3.2 techniques

In the following paragraph, the various techniques that were used in this thesis, will be ex-plained.

stacking Stacking is the process of taking the probability outputs of two models and con-catenating these probabilities into a new feature vector. The number of features is equal to the number of labels times the number of stacked submodels. These new feature vectors are then

(13)

3.2 techniques 9

fed into a new machine learning model which is hypothesized to give better results, because it captures the information from both submodels (in case of two submodels) whilst creating a less sparse feature space.

ensemble averaging Ensemble averaging (henceforth used interchangeably with ’ensem-bling’) is the process of taking the predictions (probabilities) of two or more models and av-eraging them, after which the new output vectors can be compared to the actual labels. The resulting predictions often are less complex than any of the single models which may reduce overfitting. Moreover, they often perform even better than any of the individual models. principal component analysis Principal Component Analysis (PCA) is a dimensionality reduction technique. Reducing dimensions is useful in a highly dimensional space, because unnecessary noise and redundant information is lost so it is hypothesized to perform better. If a feature space has 3 dimensions, there will be three eigen vectors, two of these will have large eigen values, one will be closer to zero. The axes are then rearranged to go along the eigen vectors, with the exception of the third eigen vector, because it is close to zero and therefore does not store much information and does not need to be represented. Hence, the three-dimensional feature space can be reduced to two dimensions. In this thesis, PCA will be used to reduce the dimensionality of word embedding vectors with high (50 or 200) dimensions. retrofitting Retrofitting is an unsupervised technique used in building word embeddings in which an already trained word embedding model is refitted using a list of semantically related words. This is done to push these words closer together in the feature space thereby re-fining the word embeddings. Something which is useful when the original embeddings are not yet able to capture their relatedness. The inspiration to use this technique came from a paper by (Faruqui et al.,2015) in which they retrofit their word embeddings using semantic lexicons. Their approach outperformed other techniques of adding semantic lexicons to word vector spaces (namely using constraints among words as regularization term). In my retrofitting ex-periment, (Faruqui et al.,2015)’s retrofitting package was used (albeit transformed to Python 3) to perform retrofitting.

concatenation of embeddings Concatenation of embeddings is the process of attaching the vectors of multiple sets of word embeddings to each other in such a way that in case of 2 sets of 100 dimensional word embeddings, the new set of word embeddings is 200 dimensional. For missing terms (e.g. not present in both sets of word embeddings), the original 100 dimen-sional vector can be supplemented with an average embedding for that missing embedding set.

multi-task learning Multi-task learning is a technique to solve multiple tasks simultane-ously by making use of the related characteristics of the tasks. It uses shared and task-specific layers with a common loss function so as to benefit from information from both tasks, whilst also improving generalization and preventing overfitting.

(14)

3.3 tools 10

3.3 tools

All models were created using Scikit-Learn (Pedregosa et al., 2011) or Keras (Chollet et al., 2015), using Spacy (Honnibal and Montani,2017), NLTK (Loper and Bird,2002) and Gensim (Reh ˚uˇrek and Sojkaˇ ,2010) as complementary packages for tasks such as preprocessing, POS-tagging, dependency parsing and creating word embeddings. Moreover, pandas (2011) and numpy (Oliphant, 2007) were frequently used in feature vector manipulation, storing and retrieving datasets and making calculations.

(15)

4

_{D A T A A N D M A T E R I A L}

4.1 collection

Twitter and Medium were scraped for messages by users whose native language was Dutch, German, Italian, Spanish, Portuguese, Russian, Polish, Persian or Hindi and whose gender could be easily determined, either using the genderguesser package or by hand. Since some languages are spoken in various countries, a target country (and sometimes also target cities) was established to limit the amount of inter-L1 variance. For Dutch, the target country was the Netherlands, for German it was Germany, for Italian it was Italy, for Spanish it was Spain, with the exception of the Basque and Catalan regions (so I explicitly refrained from targeting ’from Barcelona’ for instance), for Portuguese it was Portugal, for Russian it was Russia, for Polish it was Poland, for Persian it was Iran and for Hindi it was the New Delhi region of India (this city was always used as target word).

The BeautifulSoup1

and Requests2

libraries were used to scrape Medium and Twitter was queried using the Twitter Python API. The API allowed for the retrieval of up to 200 messages per user. Retweets were not retrieved. A search query such as ’from X’ or ’living in X’ was used targeting the user description where X is the target country or city in a target country. This method is reminiscent of the technique used in Emmery et al.’s paper (2017) in which gender was automatically inferred using simple queries such as "I’m a boy, girl, man, woman", with additional methods which account for reversing the gender in cases like ’According to Google, I’m a guy’, which suggests the user is female (Emmery et al.,2017).

To determine the gender of those users whose description was determined to fall into one of the language categories, the genderguesser Python package was used. Only for Persian, the gender had to be established manually, because the package was not compatible with names in that language. In addition, some names were ambiguous or deemed unknown by genderguessser and also had to be manually examined. The method used by Emmery et al. (2017) was not used to determine the gender of the user, because the subset of users retrieved would be too small. The user set is already narrowed down to only those users who have a description which contains some reference to their native language, the chances that they also refer to their gender is relatively small and would likely result in only very few matches.

The automatically selected usernames, genders, native languages and their user descrip-tions were saved in an Excel file and the descripdescrip-tions were then manually examined for errors. The search functions on both websites used fuzzy search which also allowed similar descrip-tions to pass through. So some users with different native languages passed through, or foreigners living in the target country, or user accounts for companies or other establishments. Users for whom the gender or native language could not be established were also filtered out, as well as users who were from different countries or were from X living in Y where X is not a target country and Y is, because these people are not native speakers. Similarly, native speak-ers living in a non-target location (e.g. especially Anglophone countries) were filtered out. 1

https://www.crummy.com/software/BeautifulSoup/ 2

http://docs.python-requests.org/en/master/

(16)

4.1 collection 12

This was done because it is usually unknown how long they have been living there and their proficiency might have been strongly influenced by the L2, at least more so than a non-native speaker of English living in their home country. After all of the false positives were filtered out, the user data (their texts) could be fetched.

For Medium, the approved users’ timeline of posts could be fetched by going to

’https://medium.com/feed/@USER’ were USER is the username of the user. This yields an RSS feed of the user’s posts in XML format, which could easily be parsed in Python. The posts were separated from the metadata, HTML tags were removed, the language of the post was determined and if the language detector (Python package) concluded that it was written in English, then the resulting posts were written to file.

For Twitter, the messages were only saved if they were not a retweet, longer than 10 words and the language detector detected the language to be English.

In figures 1 and 2, the counts per native language and per gender for each of the two genres are displayed. As you can see, there is quite a bit of variation. On Twitter, most data is available for Dutch and German, whilst for Persian and Spanish, the lowest quantities can be found. On Medium, most data is available for Dutch and Russian, whilst for Italian and Portuguese the lowest quantities can be found.

(17)

4.2 annotation 13

Figure 2: Medium counts by native language and gender

4.2 annotation

The labels that were used were the native languages of the authors and their gender. These labels were retrieved, as mentioned above, by querying their Twitter and Medium descriptions, by looking up their gender with the gender guesser or by looking at their profile picture.

4.3 processing

The following preprocessing steps were used: 1) HTML markup was removed, 2) urls were replaced by ’URL’, 3) numbers were replaced by ’NUM’, 4) the cleaned text was tokenized using NLTK’s TweetTokenizer with reduce_len = True.

I chose to retain capitalization, to potentially use this as a feature for languages rich in capitalized words (German). I also chose not to remove stopwords, because these might also provide insight into choice of words. I chose not to remove punctuation because this might also serve as a feature. Note that for the models using word embeddings, all data was converted to lowercase because otherwise the data would get all the more sparse (especially for the learner embeddings).

Moreover, the data was split into train, development and test sets. For the cross-genre tasks, the train set contains all instances from one genre, and the test and development set each contain half of the instances from the other genre. So for instance in the cross-genre native language Twitter-Medium task, the train set consists only of Twitter data, whilst the development and test set each make up half of the Medium data. For the within-genre tasks,

(18)

4.4 initial linguistic pre-study 14

the train set made up 66.67% of all instances from that genre, whilst the development and test set each contained 16.67% of the instances.

4.4 initial linguistic pre-study

In order to see how the linguistic features that would be used were distributed among the languages and genders, an investigative pre-study was done. The average use per 300 word text of all linguistic features was calculated and graphed. Note that these are averaged over all 300-word sequences and therefore although they do provide a general outlook on the distribu-tion and differences between classes, they do not say anything about an individual message which has to be classified later on.

The native language features that were used are sentence length, articles, capitalization, sentiment, multiple negatives, word order verb-object (svo) and word order subject-object-verb (sov). I chose the latter two word order features, because the actual word orders of the nine native languages fall into these two categories. The process of obtaining the word order is explained in Section5.1.1(under linguistic features). The gender features that were used are swearwords, regular color words, special color words, minimal responses, exclamation marks, question marks, sentiment, adjectives, fillers and hedges.

The native language features and their counts are graphed in figure3. We can see that sen-tence length varies quite a bit across the languages on both Twitter and Medium. Interestingly, the order of average sentence lengths on Twitter and Medium is not the same, with the longest sentences on Twitter belonging to Hindi, Italian and Spanish, whilst the longest sentences on Medium belong to Persian and Russian. This may have something to do with the format of the two websites: Twitter is more informal than Medium. The sentences are consequently longer (and more irregular) on Twitter than on Medium.

The variance in number of articles is relatively small. Spanish has the lowest number of articles on average in both genres. This is not consistent with the hypothesis that L2 speakers of English whose L1 does not use articles will use fewer articles in their L2 as well. It does have to be said that the three languages for which this is the case consistently occur as valleys in the graphs of both Twitter and Medium.

As for capitalization, German performs higher than average on Medium, but has one of the lowest average instances on Twitter. Perhaps this is due to hypercorrection in which the speaker consciously (or even unconsciously) avoids the use of a transfer feature highly charac-teristic for their native language, sometimes even to the extent that they stop using that feature when it actually is deemed correct or necessary.

The sentiment across native languages is quite varied, with relatively negative messages by Persian natives on both Twitter and Medium and more positive messages by Hindi and Polish native speakers. The sentiment ranges from -1 to 1, so the sentiment is generally neutral to slightly positive in all classes on both Twitter and Medium. The relative negativity by the Per-sian natives can potentially be explained by a high number of activists in the dataset, sharing crimes against humanity and other atrocities, contributing to a more negative sentiment.

The multiple negatives feature will be used to see if Russian natives use double negatives even in their L2. From the Twitter and Medium graphs, we can actually see peaks for Russian in relation to this feature, demonstrating that the Russian natives do use this feature more often than the other native speakers in the dataset. It should be noted, however, that the

(19)

number of occurrences, even for the Russian group, is very low: around 0.10 instances per 300 word sequence for the Russian group on Twitter (so 1 in 10 texts contains this feature) and 0.13 instances per 300 word sequence for the Russian group on Medium (so around 1 in 8 texts contains this feature).

For the word order features, there is quite some variance in the SVO use, but the SOV use is relatively stable (and much lower). Because the SOV scores are so low and close to each other, it is hard to describe any peaks and whether these peaks are present in typical SOV languages like Hindi and Persian, or in languages which use SOV in subordinate clauses (Dutch and German). We do see a valley in SVO use by Persian authors on Medium, suggesting they prefer another word order. However, on Twitter, Persian authors actually show a peak in SVO use.

In figures 4 and 5, the gender features on Twitter and Medium are illustrated. On Twit-ter, the features demonstrate little variance, with the exception of regular color use, which is demonstrably lower in the Men group. On Medium, there is a little more variance. Women swear slightly less, and ask more questions, in line with stereotypical behaviour. Yet, women also use slightly less minimal responses, which is not consistent with stereotypes, although the difference is very minimal. Women use slightly more fillers and hedges, but the differences are truly minimal and therefore do not say much about stereotypical use of language. Again, women use regular color words quite a bit more than men, so this feature may be helpful in distinguishing gender on both platforms. Although this is a rare feature which occurs less than once on average per 300 word sequence.

(20)

(21)

Figure 4: Linguistic features gender per 300 word text on Twitter

(22)

5

_{E X P E R I M E N T S}

The following main research question will be investigated and answered: To what extent can we build a system that detects the gender and native language of a speaker? This research question will be answered by building a machine learning pipeline. Different algorithms and architectures will be built and different features will be used to determine which model works best.

The main research question can be subdivided into four subquestions:

1. To what extent can we build a system that works well across genres? This subquestion will be answered by building both cross-genre and within-genre models and comparing their performance.

2. What is the best way to model multiple author characteristics? This subquestion will be answered by looking at various ways to model multiple author characteristics: i) by modeling the characteristics separately, ii) by modeling them together in a multi-task learning setup

3. What are the distinctive features that distinguish people by their gender or native lan-guage and are these distinctive features different per genre? This subquestion will be answered by performing feature analysis to discover which features perform best and try to determine why.

4. Can we distinguish different distinctive characteristics that characterize gender per native language? Similarly, are native language characteristics expressed differently per gender? This subquestion will be answered by performing predictions of gender per language and comparing the best features per language. As well as by doing predictions of native language per gender and comparing the best features.

5.1 statistical machine learning models

First of all, various statistical machine learning models were used. Usually, when using rela-tively small datasets, statistical algorithms work better than neural models. Especially SVMs are quite robust to smaller datasets and work very well with text data. That is why this was the first model that was tried and optimized. Other statistical models may be less obvious choices and may heavily depend on the data and task at hand, therefore a selection of statisti-cal models was made and they were run for each of the 8 tasks using their default parameter settings. Based on these results, three models were chosen to optimize further: Decision Trees, Gradient Boosting and K-Nearest Neighbors. All parameter optimizations (grid searches and other parameter tunings) were performed on the development set.

(23)

5.1 statistical machine learning models 19

5.1.1 SVM Parameters

The parameters that were optimized for the SVM models are shown in Table1 below. Note that the rbf kernel was quickly dismissed as an option for the models trained on n-gram and POS-tag features, due to its low initial performance. However, it was used as an option in the POS-tag distribution and linguistic features models.

N-gram-range Analyzer C Kernel

(1,1)-(1,8) word 1 linear

(2,3) char 10 rbf

Table 1: SVM parameters that were optimized

Features Tf-idf n-grams

In order to give those words which are less frequent overall but relatively frequent in a cer-tain document more weight, tf-idf n-gram features were used instead of bag-of-words n-grams which use ordinary counts as features. Those words that are given more weight are hypoth-esized to include gender-specific terms or native language specific words (native language specific errors for instance) which may set apart the classes.

POS-tags

To determine whether parts of speech in the various languages and among genders are used distinctively, POS-tag data was used as features. Two different methods were used (not includ-ing the POS distribution one demonstrated below): 1) usinclud-ing all POS-tags as features, 2) usinclud-ing POS-tags for nouns and verbs, as they are mostly content words and not really indicative of the underlying linguistic structure, and using the raw text for other parts of speech, e.g. function words which may be indicative of underlying linguistic structure. For 1) a sentence like ’Mary ate an apple during her lunchbreak’ would be transformed into ’NN VBN DT NN IN PRP$ NN’, for 2) it would be transformed into ’NN VBN an NN during her NN’.

POS-tag distribution features

To see whether the distribution of POS tags in the dataset is correlated with the distribution of POS tags in the larger population of those 9 languages, an experiment was conducted. The Universal Dependencies corpus was used to fetch 5,000 sentences for each of the 9 languages and only their POS tags were saved. Subsequently, a Linear SVM model was trained on the POS-tags of those sentences to predict the class label of each of the instances of my dataset (train, development and test sets) converted to POS-tag features. Then I used both the prob-abilities for each of the 9 classes and the decision function outputs as features (separately, as two models). These features were then again fed into an SVM and optimized using grid search to get the best performance.

(24)

Linguistic features

In order to test whether linguistic features are helpful in determining the native language or gender of a person, the following features were used, both individually and together.

1. Sentence Length

Certain languages consistently use shorter or longer sentences. Hence, this characteristic may be transferred to the L2. This has been used in i.e. the best team on the essay-track of the 2017 NLI Shared Task, ItaliaNLP Lab (Cimino and Dell’Orletta,2017).

2. Number of Articles

Russian, Polish and Hindi do not use any articles. Persian only uses the indefinite article. It is hypothesized that by counting the number of articles in each L2 text, some underly-ing L1 knowledge may be found. That is, authors from one of the native languages, may struggle to use articles in their L2 and consequently use them less frequently or not at all.

3. Double negatives

Some languages, like Russian, use double negatives. Therefore, this feature captures whether multiple negations are used within a sentence. If a text (300 words) contains a sentence with more than one negation, the feature is positive (1) otherwise, it is negative (0).

4. Capitalization

The rationale of this feature is that in certain languages (e.g. German), more words are capitalized than in others. It is hypothesized that this may transfer to the L2. Therefore, the words that are capitalized but are not in all caps are counted.

5. Word order

Syntax is one of the typical features which (unconsciously) transfers to an L2. It is one of the hardest things to learn and one of the hardest things to lose (e.g. in language attrition). Therefore, word order may hint at the linguistic background of the author. In the present study, the majority of the native languages uses SVO word order in most situations (Italian, Spanish, Portuguese, Russian and Polish). German and Dutch use SVO in main clauses and SOV in subordinate clauses, Hindi and Persian use SOV word order. The word order was determined using the Spacy library which was used for dependency parsing, which was subsequently used to determine the order of subject, verb and object, if any was present. Since, subjects, objects and verbs are not always all three present, the order could sometimes not be established.

Gender features 1. Color words

Gender researchers have conducted studies on the use of color words by men and women. Nowaczyk found that women can name more color words than men and also name more less conventional color words (Nowaczyk,1982). This feature therefore counts the number of ’regular’ color words in a text as well as the number of ’special’ color words. Special color words are color words beyond the traditional colors (blue, red, yellow, black, gray, white, green, purple, pink, brown), such as navy blue, burgundy red, ocher etc.

(25)

2. Fillers

Fillers are words which are used to fill up conversation when thinking what to say next, without signaling to the conversational partner that they are done talking. Examples are ’ok’, ’basically’, ’totally’, ’er’, ’anyway’. These filler words are said to be used more often by women than by men (Holmes,2001), therefore it is interesting to see if this is also true in the second language1

. 3. Hedges

Hedges, a form of mitigating language, are found to be more often used by women than by men (Holmes,1995). Examples are ’sort of’, suggest’, ’in my opinion’, ’apparently’, and ’appear to be’. This linguistic feature checks if that is true by counting them in each piece of text2

. 4. Swearwords

Lakoff addressed the differences between men and women in terms of using swearwords (Lakoff,1973). Hence, a feature that counts the number of swearwords in a text was used. This was done by using a list of swearwords 3

and manually filtered for ambiguous words.

5. Sentiment

This feature is a fairly experimental one. It tests whether one gender has a more positive or more negative sentiment than the other and whether a certain native language has a more or less positive sentiment than the others.

6. Adjectives

Lakoff also found that women use more adjectives than men (Lakoff, 1973). For this study, a feature will be used which counts the number of adjectives used.

7. Minimal Responses

Minimal responses like ’hm’, ’okay’, and ’right’ are used by men and women, but women use them to keep the demonstrate participation in a conversation and to keep it going, doing as Fishman calls it the ’conversational shitwork’, whereas men use minimal re-sponses to demonstrate a lack of interest (Fishman,1997), so the female response should be more frequent. Then again, Twitter and Medium data is not the same in terms of structure as conversations, so this feature may not occur often enough to investigate it properly.

8. Punctuation

The reasoning behind this linguistic feature is that research has found that men use more aggressive language than women, and women are said to ask more questions than men (Lakoff,1973). This might be evident from their use of exclamation and question marks.

5.1.2 Decision Trees

For the decision tree model, first an initial couple of test runs were done using different settings. From this, it emerged that the parameters and values in Table 2 would benefit most from optimization.

1

A list of fillers was retrieved from https://github.com/words/fillers 2

A list of hedges was retrieved from https://github.com/words/hedges 3

(26)

Parameters

N-gram-range Analyzer Max depth Min samples leaf

(1,1)-(1,3) word 5 5

(1,6)-(1,8) char 10 10

20

Table 2: Decision Tree parameters optimized using grid search

Features

As features, word and character n-grams were used. The n value was determined using grid search.

5.1.3 Gradient Boosting

First the n-gram-range and analyzer parameters were optimized using the ranges shown in Table 2, then the best options (n-gram range (2,3) and character level analyzer) were used in the grid search for the best gradient boosting related features. The parameter settings that were queried are found in Table3.

Parameters

N-gram-range Analyzer Max depth Min samples leaf N_estimators

(2,3) char 3 1 100

6 5 300

Table 3: Gradient Boosting grid search parameters

Features

As with the Decision Tree model, word and character n-gram features were used in which the value of n was determined using grid search.

5.1.4 K-Nearest Neighbors Parameters

N-gram-range Analyzer Neighbors Weight

(1,1)-(1,8) word 20 uniform

(2,3) distance

(27)

5.2 neural machine learning models 23

Features

As with the Decision Tree model and the Gradient Boosting model, word and character n-gram features were used in which the value of n was determined using grid search.

5.2 neural machine learning models

5.2.1 Word embeddings features experiment

In order to determine what kind of word embeddings to use, an experiment was conducted. It was not obvious which embeddings to use since the data that is used in this thesis is learner English. Do we use learner English embeddings, ’regular’ pre-trained social media (Twitter) embeddings (which are often larger), retrofitted embeddings, initialized learner embeddings or a combination of learner and regular (through concatenation)? To determine which em-beddings to use, 25D and 100D emem-beddings in all categories were tested on an SLP and an LSTM.

Structure

The single layer perceptron had one input and one output layer, a batch size of 16, used adam optimizer, ran for 50 epochs, and used softmax activation of the output layer. The LSTM had 128 neurons, followed by 0.5 dropout and an output layer. It used a batch size of 16, adam optimizer, ran for 50 epochs and used softmax activation for the output layer. We can divide this experiment into several subtasks:

Subtasks

The following subtasks can be distinguished:

1. The first subtask uses regular pre-trained Glove Twitter embeddings. The embeddings were created using 2 billion tweets and 27 billion tokens.

2. The second subtask uses the same regular pre-trained Glove Twitter embeddings but now controlled for size, so its size is the same as the learner embeddings.

3. The third subtask uses learner embeddings. The learner English embeddings were com-posed of 5000 sentences from the Cambridge Learner Corpus and my own dataset ( Cam-bridge University Press,2017).

4. The fourth subtask uses retrofitted embeddings. Each of the regular trained, pre-trained controlled for size and learner embeddings (in both 25 and 100 dimensions) was retrofitted on the most informative features and on out of vocabulary words.

5. The fifth subtask uses initialized learner embeddings. As the name suggests, learner embeddings were initialized using the pre-trained Glove Twitter embedding space.

(28)

5.2 neural machine learning models 24

6. The sixth subtask uses concatenated embeddings. These are a combination of learner and regular pre-trained Glove Twitter embeddings, each in either 25 dimensions (50 total) or 100dimensions (200 total).

Retrofitting

In the first retrofitting experiment, the most informative features of the SVM model were retrieved (5 features) for native language within-genre for both Twitter and Medium. The within-genre model was used because it performed better than the cross-genre model in the SVM experiment (see Table6, 0.235 and 0.300 for cross-genre Twitter-Medium and Medium-Twitter, 0.822 and 0.867 for within-genre Twitter and Medium respectively). So 5 features per language per genre leaves us with 90 features. These features were then combined into a retrofitting list and this list was subsequently submitted to an existing embedding model (.txt) using the retrofitting package (Faruqui et al., 2015). In total, 6 new embedding files were created, 2 retrofitted on regular embedding data (Glove) (100 and 25 dimensions), 2 retrofitted on learner data (100 and 25 dimensions) and 2 retrofitted on regular embedding data controlled for size (25 and 100 dimensions). In the second retrofitting experiment, each of the regular pre-trained, pre-trained controlled for size and learner embeddings (in both 25 and 100dimensions) was retrofitted on out of vocabulary words. The out of vocabulary words were determined by selecting all data for each native language as a set, then filtering out words that are included in either the Words corpus (NLTK) or Wordnet. What’s left are native-language specific out of vocabulary words, which might be indicative of their English and could improve the feature space. These words were then split into groups of 5 words (consistent with the most informative features approach) and then the permutations of each of these groups of 5 was calculated and added to a retrofitting list. The number 5 was chosen because anything above 5 creates enormous numbers of permutations which are hard to handle in terms of computing power. As with the other retrofitting approach, 6 new embedding files were created as well. The same procedures (most informative words and out of vocabulary words) were also attempted for gender, but both of these processes did not yield good results. Most informative features yielded similar words for both classes, whilst out of vocabulary words gave a sample that was too large and not reminiscent of the respective gender in any way. Therefore, only the native language oriented retrofitted embeddings were created. Nevertheless, these word embeddings were tested on all eight tasks, including the four gender tasks, to see if it also has a positive effect on gender prediction.

Concatenation and PCA

The final subtask uses concatenated embeddings. The feature vectors of two models (a com-bination of learner and regular pre-trained Glove Twitter embeddings, in both 25 dimensions or 100 dimensions) is concatenated to create a new set of feature vectors. Whenever a word was not found in both embedding sets, the feature vector of the one that was present is sup-plemented with an average embedding of the embeddings that do not contain said word. This way, no mismatch in dimensions is created. In addition to this first concatenation method, an-other set of embeddings was created by concatenating embeddings in the same way as before, only this time the concatenated embeddings are subsequently reduced back to the original size of the parts using Principal Component Analysis. Whenever a word was not found in

(29)

5.3 model combination 25

both embedding sets, the original word’s feature vector was used, since this vector was also of the target size and was deemed more informative than averaging the word embeddings for instance.

All subtask results will be compared and the best embeddings will be used in the neural models mentioned below.

5.2.2 LSTM

Using the best word embeddings for each task, obtained using the word embedding experi-ment, a grid search was performed on the development set to obtain the best LSTM results. The parameters that were searched and their values, are shown in Table 5. Note that in the word embedding experiment a higher number of epochs (50) was used and a higher dropout (0.5) for the LSTM. For the grid search, the number of epochs was reduced due to time con-straints. It is expected that this will not make such a significant difference. If it does, however, then the original model can still be used as ’best’ model. Moreover, three LSTM layers are used instead of the just one.

Batch size Epochs Dropout Optimizer Neurons

16 15 0.1 adam 64

32 30 0.3 adagrad 128

Table 5: Parameters used in grid search LSTM

5.2.3 BiLSTM

Using the best parameters found for the LSTM and the best embeddings per task, a BiLSTM model was constructed for each task, and only the number of layers was used as a variable (either one or two BiLSTM layers were used).

5.3 model combination

5.3.1 Stacking SVM probabilities

In order to examine whether combining different features is actually beneficial, the probabili-ties of two different classifier outputs were stacked and fed into a new SVM classifier. Stack-ing was done for i) best n-gram/pos-tag/pos-tag and function words features (dependStack-ing on which performed best on the respective task) + linguistic features, ii) n-grams + pos-tag fea-tures, iii) n-grams + pos-tag and function words features (sometimes abbreviated to ’pos func’ or ’pos-func’ for brevity.

5.3.2 Ensembles

Using the best SVM and the best neural model per task, an ensemble was constructed by averaging the predictions made by these two models.

(30)

5.4 task dependency experiment 26

5.4 task dependency experiment

In order to determine whether multi-task learning is a good fit for the tasks at hand, e.g. if gold information (labels) from one task may help in modeling the other, an experiment was conducted. The experiment consisted of making predictions based on subsets of the data. To predict native language the following predictions were made:

1. Native language was predicted over only the male data, thereby eliminating the gender variable. This sample is of equal size as in the second group using female data below. 2. Native language was predicted over only the female data, again the size of this sample

was controlled for.

3. Mixed predictions for native language, with equal amounts of female and male data. This is the control experiment. If either of the one-gender samples performs better than this mixed sample, it may be possible that there is an interaction between a certain gender and native language, so gender may have a distinguishing role.

4. Predictions on the entire dataset with gender supplied as additional feature. In this case, gender is used as an extra feature, to see if that knowledge is reflected in the performance of the model. If the predictions of the entire dataset with gender supplied is higher than the original accuracy without gender supplied, this may also indicate a relationship between gender and native language.

5. Predictions on the mixed dataset (see 3) with gender supplied as additional feature. If this model performs better than the one trained on mixed data without gender supplied, this may be another signs that gender might help to predict native language.

The same was done the other way around: native language-specific gender predictions in the above-mentioned subgroups.

In addition, the best stacked feature combinations found (see Section 5.3.1) per task were used in the task dependency experiment too, to see if less sparse feature spaces (of only 18 or 4 features) are more suitable for this experiment and give better results in combination with an additional gender or native language feature.

5.5 multi-task learning

If the results of the task dependency project are promising, a multi-task learning model will be set up in which the two tasks will be learned simultaneously using task-specific and shared layers and a shared loss function.

The task-specific layers both consisted of an embedding layer which takes word embed-dings as input and a Bidirectional LSTM layer, with recurrent and regular dropout set to 0.3. The shared layer is a Dense layer with 60 units followed by 50% dropout and an output layer for each of the two tasks. The loss is shared but the weight of the main task is set to 1.0 whereas the weight of the auxiliary task is set to 0.4. These values were arrived at by means of stepwise tuning of the parameters.

(31)

5.6 feature analysis 27

5.6 feature analysis

In order to determine if there are certain typical features which distinguish groups by their gender or native language, the most informative features per class per task will be determined for both the Twitter data and the Medium data.

In order to determine if gender features expressed differently per native language, the most informative gender features per native language will be determined for both Twitter and Medium.

In order to determine if native language features expressed differently per gender, the most informative native language features per gender will be determined for both Twitter and Medium.

(32)

6

_{R E S U L T S A N D D I S C U S S I O N}

In this chapter, the results of the experiments will be presented and discussed. Section6.1deals with the results of the statistical and neural models used in this study and the best parameter settings. Section6.2describes the task dependency and multi-task learning experiments. Sec-tion6.3 reports and discusses the best models, which are shown in Tables30and31. Finally, Section 6.4 looks at the most informative features per native language and per gender, and analyzes the most informative features for each language per gender and for each gender per language.

6.1 statistical and neural models

6.1.1 Results SVM

In Table 6, you will find the results of the best SVM models after optimization using grid search. The best models per task are in bold. The parameters used in these best models can be found in Table7. As you can see, the n-grams model is often the best model, with the exception of the cross-genre gender task trained on Twitter data in which case the model using POS-tag and function word features is the best. Moreover, all within-genre models outperform their cross-genre counterpart by quite a large margin. In addition, character n-gram models work best in 6 out of 8 cases. Table8shows the results of using each linguistic feature separately to train an SVM, as well as the results of training it on gender-specific linguistic features (All G) or native language-specific (All NL) features and all 13 (native-language and gender specific together), which is also shown in Table6. The linguistic features never outperform the models trained on n-grams, POS-tags or POS-tags and function words.

Task N-grams POS tags POS + func POS distr. (NL) Linguistic (all 13)

Cross-genre Native Language Twitter - Medium 0.235 0.196 0.195

0.172

0.153

Cross-genre Native Language Medium - Twitter 0.300 0.254 0.279 0.252

Cross-genre Gender Twitter - Medium 0.641 0.644 0.666 N/A 0.619

Cross-genre Gender Medium - Twitter 0.635 0.611 0.614 N/A 0.599

Within-genre Native Language Twitter 0.822 0.478 0.638 0.266 0.341

Within-genre Native Language Medium 0.867 0.487 0.636 0.237 0.252

Within-genre Gender Twitter 0.932 0.719 0.856 N/A 0.662

Within-genre Gender Medium 0.908 0.682 0.778 N/A 0.616

Table 6: Accuracy scores for 5 different types of features on the development set, best scores in bold

(33)

6.1 statistical and neural models 29

Task Best model Word/char Kernel C N-gram range

Cross-genre Native Language Twitter - Medium N-gram char Linear 1 (1,5) Cross-genre Native Language Medium - Twitter N-gram char Linear 10 (1,5)

Cross-genre Gender Twitter - Medium POS + func word Linear 1 (1,2)

Cross-genre Gender Medium - Twitter N-gram word Linear 20 (1,1)

Within-genre Native Language Twitter N-gram char Linear 20 (1,4)

Within-genre Native Language Medium N-gram char Linear 10 (1,5)

Within-genre Gender Twitter N-gram char Linear 10 (1,4)

Within-genre Gender Medium N-gram char Linear 10 (1,7)

Table 7: Parameters of best models per task (not including stacking models as best models here!)

Task Adj. Det. Capit. Color Fillers Hedges Min.resp. Mult.negs Punct Sent length Sentiment Swearwords WO All NL All G All 13

Cross-genre NL T-M 0.206 0.206 0.206 0.206 0_.196 0_.191 0_.204 0_.196 0_.164 0.206 0.206 0.206 0_.198 0_.180 0_.165 0_.153 Cross-genre NL M-T 0_.250 0_.246 0_.194 0_.250 0_.250 0_.250 0_.250 0_.249 0_.250 0_.232 0_.250 0_.250 0_.250 0_.211 0.256 0_.252 Cross-genre G T-M 0.619 0.620 0.620 0.626 0.620 0.620 0.620 0.620 0.620 0.620 0.620 0.620 0.620 0.619 0.619 0.619 Cross-genre G M-T 0_.599 0_.599 0_.601 0.606 0_.603 0_.599 0_.599 0_.599 0_.599 0_.599 0_.599 0_.599 0_.599 0_.600 0_.599 0_.599 Within-genre NL T 0_.266 0_.280 0_.273 0_.270 0_.290 0_.273 0_.283 0_.276 0_.276 0_.273 0_.280 0_.270 0_.287 0_.297 0_.328 0.341 Within-genre NL M 0.207 0.223 0.223 0.170 0.210 0.203 0.190 0.190 0.190 0.207 0.190 0.190 0.197 0.226 0.223 0.252 Within-genre G T 0_.597 0_.597 0_.590 0_.570 0_.577 0_.563 0_.563 0_.563 0_.567 0_.594 0_.577 0_.577 0_.570 0_.659 0_.631 0.662 Within-genre G M 0_.580 0_.580 0_.580 0_.580 0_.580 0_.580 0_.580 0_.580 0_.593 0_.584 0_.580 0_.580 0_.580 0_.580 0_.593 0.616 Table 8: Accuracy linguistic features on the development set, best models in bold. Column names in

full: Adjectives, Determiners, Capitalization, Color, Fillers, Hedges, Minimal Responses, Mul-tiple negation, Punctuation, Sentence Length, Sentiment, Swearwords, Word Order, All Native Language Features, All Gender Features, All 13 Features. G = Gender, NL = Native Language, T = Twitter, M = Medium.

Task Best + best linguistic Best n-gram + POS Best n-gram + POS func Acc. best reg. SVM

Cross-genre NL T-M 0.231 0.234 0.227 _0.235 Cross-genre NL M-T 0.260 0.264 0.268 0.300 Cross-genre Gender T-M 0.590 0.641 0.638 0.666 Cross-genre Gender M-T 0_.634 0_.633 0_.631 _0.635 Within-genre NL T 0.857 0.857 0.853 0.822 Within-genre NL M 0.856 0.849 0.846 0.867 Within-genre Gender T 0.939 0.935 0.949 0.932 Within-genre Gender M 0_.889 0_.885 0_.892 _0.908

Table 9: Stacking experiment results, comparing the accuracy of the best stacked models to the best ’regular’ SVM accuracy without stacking (on the development set). NL = Native Language, T =

Twitter, M = Medium. POS func = POS and function word features.

Table9shows the best results for the stacking experiment in which the probability predic-tions of different SVM models were stacked to attempt to generate better results. The best settings were also optimized using grid search here. Unfortunately, the only improvements were seen for the within-genre Twitter tasks for both native language and gender prediction. Within-genre NL on Twitter data improved from 0.822 to 0.857 with stacked n-grams + POS features (or linguistic features, same score). Within-genre gender on Twitter data improved from 0.932 to 0.949 with n-grams + POS and function word features. The best stacking fea-tures per task were also used in the task dependency experiment which will be described later on.