Author Profiling: The Effect of Gender on Social Media Language Use

(1)

Author Profiling: The Effect of Gender on

Social Media Language Use

Malon van Essen

Master thesis Information Science Malon van Essen S3228991

(2)

A B S T R A C T

Author profiling has become an important element of Natural Language Processing (NLP). Author profiling is the automatic identification of characteristics of an author. An author does not need to be a writer of a book, it can also be someone who posted a piece of text on social media.

This thesis focuses on one of the characteristics of author profiling: gender. Specifically the influence of gender on the language that is used on social media. The social media platforms from which data is used are Twitter and Reddit.

Multiple classifiers are created, one for each gender and social media platform combination, which results in six different classifiers. While in the majority of previous work gender consisted of only ’male’ and ’female’, this research also includes ’non-binary’. Non-binary includes any gender identification that does not belong to either male or female.

The data is collected from both Twitter and Reddit and are annotated with the gender of the user from whom the social media posts are. The Twitter set was split into a 60% training, 20% validation and 20% test set. For the Reddit data a 70/15/15 split was used.

In the texts many words are replaced with tags, followed by the creation of n-grams and additional features which are added to the classifiers. The final Twitter classifiers have an accuracy of: non-binary 79%, male 72% and female 76%. The final Reddit classifiers have an accuracy of: non-binary 71%, male 71% and female 74%.

The top contributing features for Twitter typically non-binary related features, but the features of the male and female classifiers are not necessarily related to the gender, even though these still contain things such as the male (♂) and female (♀) symbols, as well as profanity use which can be linked to genders.

The top contributing features for Reddit can mostly be classified as pronouns, and ,in particular for the non-binary classifier, there are features clearly related to the gender.

The influence of gender is limited and there are other factors that play a role in the language and writing style of users on social media. One of the factors is the social media platform itself. Whether a user creates a post on Twitter, or on Reddit does affect their language use.

(3)

C O N T E N T S

Abstract i Preface iii 1 introduction 1 2 background 2 3 data 3 3.1 Twitter . . . 3 3.1.1 Collection . . . 3 3.1.2 Processing . . . 4 3.2 Reddit . . . 5 3.2.1 Collection . . . 5 3.2.2 Processing . . . 5 3.3 Final data . . . 6 4 method 8 4.1 Pre-processing . . . 8 4.2 Baseline models . . . 9 4.3 Feature selection . . . 9 4.4 Final models . . . 10

4.5 Cross genre testing . . . 10

4.6 Additional research . . . 11 5 results 12 5.1 Non-binary . . . 12 5.2 Male . . . 13 5.3 Female . . . 14 5.4 Additional results . . . 15 6 conclusion 16 6.1 Discussion . . . 16 6.2 Future work . . . 17 Appendices 19 a features 19 b final models 20 c top features 21 c.1 Twitter . . . 21 c.2 Reddit . . . 21 ii

(4)

P R E F A C E

You are currently reading my thesis ’Author Profiling: The Effect of Gender on Social Media Language Use’. This thesis has been written as part of my master Information Science. Even though the process has frustrated me more than once, and the original concept for this thesis did not entirely work, I am pleased with the final system and its results. It was quite interesting to learn some more about author profiling and specifically identifying gender on social media.

I would like to thank my supervisor mr. Gosse Bouma for guiding me and helping me get started in the beginning of the process. I would also like to thank my supervisor for providing some ideas, such as combining multiple posts from the same user to create longer texts, which have helped me while working on the thesis.

(5)

1

_{I N T R O D U C T I O N}

Automatic identification of characteristics of an author, otherwise known as au-thor profiling, can be a very useful tool. For example, take the two characteristics ’gender’and’age’. Knowing these characteristics can be very useful for targeted advertisement, since the advertisements for an elderly woman should preferably be very different to the advertisements for a teenage boy.

To be able to use author profiling as a tool for things such as targeted adver-tisement, it needs to be possible to determine what the characteristics of a writer are.

This thesis specifically focuses on the characteristic’gender’and which features contribute to the prediction of the correct gender from social media data. The social media platforms that are used for this research areTwitter1

andReddit2 .

In recent years social media has become a valuable source of data for research. However, on most social media platforms users can be completely anonymous, which can make doing research significantly more difficult. This is where author profiling comes into play, by for example being able to determine the gender of an anonymous user. This can be useful for research, but also, as mentioned before, marketing since this will make the use of targeted advertisements easier.

This thesis aims to answer the following research question:

what influence does gender have on the language and writing style that is used on social media?

In addition to the research question, there are a number of sub-question that will be answered in this research:

(1) how universal is the effect of gender on language?

(2) is there a noticeable pattern among wrongly predicted users? The questions are answered using three types of gender: male, female, and non-binary. In the context of this research non-binary includes any gender identification that does not belong to either the male or female category.

Most often non-binary is either not included, or barely included in authorship attribution studies regarding gender and social media (seechapter 2). This means that a large group of users is left out of research. The current research includes non-binary as a gender in an effort to discover how gender influences language on social media, and to discover whether non-binary users are distinguishable from users of other genders.

To answer the research question and the related sub-questions, first the back-ground and related works are explored (chapter 2). What follows is an explanation about the data that is collected and pre-processed (chapter 3) and subsequently used for the creation, training, and testing of several classifiers on both Twitter and Reddit data (chapter 4). Inchapter 5 the results from the classifiers are discussed and inchapter 6the research question is answered using the information collected in the previous chapters.

1

https://twitter.com/

2

https://www.reddit.com/

(6)

2

_{B A C K G R O U N D}

Krüger and Hermann (2019) investigated recent state-of-the-art in gender

identi-fication in texts from scientific publications published between January 2017 and January 2019. After filtering papers, they came to a total of 59 approaches to gender identification. All approaches made use of the binary gender model, which accord-ing toKrüger and Hermann(2019) came as no surprise based on previous research

on gender detection systems, but puts into question how useful those systems are in ’real-world’ situations.

A common pre-processing approach in the different systems is tokenization and replacing markers such as usernames and hashtags with placeholders. All approaches made use of machine learning algorithms in some way, ranging from neural networks to support vector machines (SVM), mostly applied to social me-dia text. The accuracy of the systems mostly ranges from 75% to 85% with some systems scoring lower and two systems having an accuracy of 90%+.

More recently the PAN Author Profiling Task 2019 introduced the Celebrity Pro-filing Task (Wiegmann et al.,2019). This task included, in addition to male and

female subsets, a small subset of 50 non-binary people. While this slightly ex-panded the classification problem, non-binary made up only 0.1% of the total data set. A total of 8 participants submitted a system, with the most popular algorithms being SVM and logistic regression. The best result was obtained using a logistic regression system with tf-idf vectors of the 10,000 most frequent bigrams, with an accuracy score of 92.6%. Similar to the the approaches in theKrüger and Hermann

(2019) research, replacing the username, hashtags and links with placeholders is

common practice.

As a result of a small non-binary subset as introduced byWiegmann et al.(2019),

treating the multi-class classification problem as another binary classification be-tween men and women would not have a big effect on the performance of the classifier.

This thesis aims to extent previous work by creating a data set which is equally divided between a male, female, and a non-binary data set. Since the results of

Krüger and Hermann (2019) andWiegmann et al.(2019) suggest that both SVMs

and logistic regression are still viable choices for gender prediction when trying to achieve a good accuracy, those algorithms will be used for the classifiers.

Some of the works that have been mentioned, while replacement tags are used for users and hashtags and such, still largely dependent on the topic and context of the posts. This can introduce some bias with for example posts about cars being classified as Male, and posts about make-up classified as Female.

To eliminate this bias in the current research, context words are replaced by their part-of-speech tags. This means that the classifiers do not rely on the topic of conversation, but rely on the writing style of a person, such as sentence structure, part of speech such as nouns and verbs, and order of words. This gives a more general overview of how the different genders use language on Twitter and Reddit.

(7)

3

_{D A T A}

For this thesis two new social media data sets are created, one is created from Twitter data and the other is created from Reddit data. There are several reasons for choosing to create two new social media data sets.

While there are multiple data sets available that include social media posts and gender information, most commonly Twitter data sets, these often contain only ’male’ and ’female’ as genders. This means that those data sets do not include non-binary users. In the existing data sets that do include posts from non-non-binary users, such as the data set used in the Celebrity Profiling Task (Wiegmann et al.,2019),

those users make up only a fraction of the data set. As stated in chapter 2, in the case of the Celebrity Profiling task non-binary users make up only 0.1% of the entire data set. For the current research it is preferable that the data is equally divided over the three gender groups: ’male’, ’female’, and ’non-binary’.

Other than Twitter, a Reddit data set is needed. While there are many other popular social media platforms, such as Tumblr, Facebook, and Instagram, these platforms do not necessarily provide many self-written text posts for every user. Reddit, on the other hand, is predominantly used as platform for discussion, and therefore has a vast amount of available text posts. Although there are a number of Reddit data sets available, no recent data sets include the users’ gender, which is why a new Reddit set of data is created.

Even though the creation of the data sets for both Twitter and Reddit are is not exactly the same for both platforms, a general method is followed in both cases. This method consists of finding users, scarping posts from those users, cleaning the acquired posts, and finally dividing the data into training, validation, and test sets.

3.1 twitter

3.1.1 Collection

All Twitter data is collected using the Tweepy1

python library for accessing the Twitter API.

The first step of Twitter data collection is compiling a list of users for all genders. On Twitter users often have their preferred pronouns, or their self-proclaimed gen-der, in either their display name, or in their description (otherwise called a user’s ’bio’ on Twitter). To access the information a search function is used.

Since knowing the correct gender of a user is important, the search function searches for users that have keywords related to their gender in their display name or bio. These keywords can include pronouns, such as ’he/him’ which refers to a male, or statements, such as ’I am a woman’ which refers to a female.

The acquired users are then used to collect tweets. There are two main con-straints on both the users as well as on the on the tweets that are collected:

• Tweet threshold: A user has to have posted at least n amount of tweets on their profile. In the case of this research, threshold n is 1,000. A threshold is used, because a user has to have a sufficient amount of tweets on their profile. During the collection of tweets and the following pre-processing, retweets will be excluded and more tweets will be removed. This means that the actual number of collected tweets will be smaller than the threshold. To ensure 1

https://www.tweepy.org/

(8)

3.1 twitter 4

that a user has enough tweets that can be used in the final data set for the classification, the threshold is used.

• No retweets: Since this thesis is based on text written by someone of a certain gender, using tweets that are retweets are not very useful. Retweets are almost always written by someone that is not the user, and since the only interest is in the tweets which are written by the user for whom the gender is known, retweets are of no value for this research.

In the data set, the labels used for the tweets is a shortened version of the gender that corresponds to the user who has written the tweets. For example, if the user is a male, all his tweets will be labeled with a’M’. The same happens with a female (’F’), and a non-binary (’NB’) user.

3.1.2 Processing

To be able to test some features, some pre-processing is required, which consists of cleaning the data, and dividing the data into training, validation, and test sets.

Even though the majority of tweets that have been collected are written in En-glish, there are a number of non-English tweets that are removed. The non-English tweets are removed, because this research only takes into account English tweets, thus a code-mixed data set is undesirable.

In addition, there is a different set of tweets that needs to be removed: the automated tweets. These automated tweets often occur when a user has linked their Twitter account to an app or a website. For example, when a user has linked their Twitter account to their YouTube account and shares a video, a tweet that contains the video title, video link, and YouTube’s username will be tweeted:

"TOP 10 BEST CAT VIDEOS OF ALL TIME!https://youtu.be/cbP2N1BQdYc

via @YouTube"

The issue with such tweets is that they contain little to no self-written text. As a result, the tweet does not represent the writing style of the user, and thus the tweet is of no use to this research.

There is a considerable amount of tweets which does not need to be removed, but which need to be altered. These tweets contain social media entities such as usernames, links, and hashtags. While these social media entities are important features for other classification tasks, for the current research the entities provide no useful information in the format they are in (for example #RandomHashtag or @username). While the exact entities give little information, their presence and posi-tion in a tweet does provide some useful informaposi-tion. For this reason social media entities are replaced by a general tag (for example, a@usernameis replaced by the tag’<USER>’).

In addition to cleaning up tweets, the data set is cleaned as well. Among other things, duplicates are removed. A tweet is considered a duplicate when both the text and the user are the same. Including duplicates in the data set might bias the classifiers on the writing style in these duplicates, which is to be avoided.

For the division of data two things are needed: enough data points per user, and enough text per data point. First, some additional users and tweets are removed based on two constraints: minimum tweet length, and minimum number of tweets per user. If these threshold are not reached, the tweets/users are removed from the data set. Since a single tweet often contains too little information for a classifier, for the remaining users tweets are concatenated per five. The result is per user a number of data points with texts of at least 50 tokens. This size text per data point should provide sufficient information for training and testing the classifiers.

Finally the data is split into training, validation, and test sets. Since approxi-mately equal portions of the genders is preferable, the data is split into various sets

(9)

3.2 reddit 5

per gender, instead of splitting the entire data set at once. The split that is used is a 60/20/20 division: 60% training set, 20% validation set, and 20% test set. While a bigger training set. The exact composition of the data sets can be found inFinal data.

3.2 reddit

As stated previously, the method for collecting Reddit data is similar to method for collecting Twitter data. Nonetheless, there are some differences which will be explained in this section.

3.2.1 Collection

For Reddit the python libraryPRAW2

is used for accessing the Reddit API.

The difficulty with Reddit is that users, contrary to Twitter users, rarely use their preferred pronouns or self-proclaimed gender in their username or description. This makes the process of determining the gender of a user more complicated. To bypass this problem, a search function is used to search for terms that have been used in text posts, instead of in their username of description. Through these posts, access is obtained to the user who posted the submission.

There are two methods of creating text posts, one is creating a new submission on for example a subreddit (which is a Reddit community dedicated to a particular topic), and the other is commenting on someone else’s post or comment. To create the Reddit data set, both of these submission methods are scraped for every user in the data set.

Reddit submissions consist of two elements: a title and a text element. When a user posts a submission only the title is required. For this reason, a submission can consist of only a title, or both a title and a text element. As a results, for this thesis both the title and the text element are used and combined to create a single text.

The data is labeled in the same way as the Twitter data is labeled: if the user is male’M’, if the user is female’F’, and if the user is non-binary’NB’.

3.2.2 Processing

The Reddit data set requires additional pre-processing to the Twitter pre-processing method. Other than the tags that are used for the Twitter data, the Reddit data has some additional tags that are added due to the Reddit Markdown3

. These tags include Markdown syntax such as ’_italic_’ as ’<ITALIC>’ and ’**bold**’ as ’<BOLD>’. Different to the social media entities, which are replace by their respective tags, the text itself is not replace. It is the symbols that surround the text that are replace by a tag in front of the text, for example’_italic text_’would changes into’<ITALIC> italic text’.

Before division of the data set, users need to be checked for possibly having multiple genders assigned during collection. For example one Reddit user was assigned two gender labels: ’M’ and ’F’. This user posted two posts which got captured during queries for different genders: one post is titled ’I am a woman’ which is why the user got labeled as female, at the same time the user has a post titled’I am a man’which is why the user got labeled as male. For such users the gender is verified manually, if that is not possible the user is removed from the data set.

For the division of the data set, the posts for every user are concatenated per four posts compared to the five posts that are concatenated for the Twitter data. This is 2

https://praw.readthedocs.io/en/latest/

3

(10)

3.3 final data 6

because the Reddit data set is smaller than the Twitter data set and. Furthermore, on average, Reddit posts are longer than Twitter posts, which means that fewer posts are necessary to get a sufficient length text for the classifiers.

The final training, validation, and test sets split is 70/15/15: 70% training set, 15% validation set, and 15% test set. This split is different from the Twitter data split, because the Reddit data set is smaller, which means that a larger portion of the data set is needed to have enough data for training the classifier. This division does also leave enough data for both the validation and test sets. The exact composition of the data sets can be found inFinal data.

3.3 final data

The data in all the sets is formatted in the following way: username, posts, gender label. Table 1is an example from the Reddit data set.

Username Posts Gender

justvibinarthistory

"[’(Scout | They/them) Looking for advice on hair (more in comments)’, ’Same except the grandparents would be the first image again’,

’Hello I’m Scout and here is a self portrait! (they/them)’,

’I might make it a commission option when I turn 18 in two months!’]"

NB

Table 1: Example Reddit data

Table 2 andTable 3 show the composition of the Twitter and Reddit data sets, per data set and per gender or total. Rows indicates the size of the data sets, which means that the Twitter training set is of size 11,175, the validation set of size 3,725, and the test set of size 3,727. The Reddit training set is of size 10,059, the validation set of size 2,155, and the test set of size 2,157. Users indicates the number of unique users per data set. Average post length is the average length of a post in characters.

The number of unique users in the different data sets is not equivalent to the actual total number of unique users in the entire data set. This is caused by the presence of the same user in multiple data sets, which is the result of splitting the data by fractions rather than by users.

Train Validation Test

Rows Non-binary 3226 1075 1076 Male 3924 1308 1309 Female 4025 1342 1342 Total 11175 3725 3727 Users Non-binary 170 164 163 Male 200 195 195 Female 188 187 187 Total 558 546 545

Avg post length Non-binary 774 764 769

Male 769 766 762

Female 802 797 803

Total 783 777 779

(11)

3.3 final data 7

Train Validation Test

Rows Non-binary 3238 694 694 Male 3300 707 708 Female 3521 754 755 Total 10059 2155 2157 Users Non-binary 214 184 179 Male 197 176 177 Female 208 183 181 Total 619 543 537

Avg post length Non-binary 1032 1013 1018

Male 987 987 927

Female 1013 1012 951

Total 1011 1004 964

(12)

4

_{M E T H O D}

The concept behind this thesis is to get a better understanding on the influence of gender on writing styles on social media, more specific, Twitter and Reddit. To get a good impression of the influence of each of the genders on both platforms, separate binary classification models are created for each gender (male, female, and non-binary). In other words, a total of six classifiers are created, three binary classifiers for both Reddit and Twitter.

Despite the fact that different platforms and classifiers are used, the method of creating a classifier is the same for every classification model, whether it be for Twitter data, or for Reddit data.

The following sections explain the process of creating the classifiers, including

Pre-processing,Baseline modelsandFeature selection. Furthermore it explains Ad-ditional researchthat is done using the results from the classifiers.

4.1 pre-processing

Notwithstanding the pre-processing that is done inchapter 3, additional pre-processing is required to prepare the data for the classifiers. These additional pre-processing steps consist of, among other things, generalization and normalization of the data.

Social media data often contains repetition, both in the shape of tokens and characters. Repetition in social media texts is possibly used for adding emotion or sentiment and more (Darics, 2013), and while this is useful information for

senti-ment analysis, with regards to the current research it primarily adds noise to the data. It is for this reason that characters that are repeated, are compressed to a sin-gle occurrence of that character. For example’yaaaaaas’ is compressed to ’yas’. The compression is only done for characters that are repeated in sequence three or more times, since there are many English words where a single letter is repeated once (think of’apple’).

Emojis are also often repeated, but contrary to other characters emojis are not compressed to single occurrences, they are separated. When they are not separated, a sequence of emojis is counted as a single token. These sequences provide less information than the emojis when they are split up, since the exact sequences are often very sparse. As a solution the emojis are separated, which gives a better overall picture of which emojis are used and in what order they are used.

The biggest pre-processing step is generalizing the data. Broadly speaking, gen-eralization is done by replacing certain words by tags. The purpose of generaliza-tion is to prepare the data in such a way that the focus is on writing style, sentence structure and word order instead of, for example, context words. The three major kind of tags that words are replaced with are part-of-speech tags, abbreviation tags, and profanity tags. Part-of-speech tagging does not happen until both the abbrevi-ation and profanity tags are applied. Every token is replaced by at most one tag, if a tag replacement is already applied before part-of-speech tagging, the token is not replaced by a part-of-speech tag.

The tag replacement starts with replacing abbreviations with an abbreviations tag. Unless an abbreviation is an acronym of a proper noun, they are not recognized by the part-of-speech tagger. Other than that, the use of abbreviations might possi-bly be different between genders, which is why an abbreviations tag is introduced.. To be able to recognize abbreviations that are commonly used on social media, a dictionary consisting of abbreviations with their meaning is created. The

(13)

4.2 baseline models 9

ations in this dictionary are based on the most popular social media abbreviations provided byPinson(2018),Sehl(2019), andLee(2020).

The other tag tokens are replaced with is the profanity tag. According to studies, gender affects cursing frequency, with men cursing more frequently than women (Wang et al., 2014). Therefore, the general use of profanity is used as a feature

for the classifiers. The ’general use of profanity’ means that the classifier will not use the exact profane words that are used, but the fact that profanity is used. So, when profanity is used and how often it is used. To recognize such words, a list of common profane words is created. If a word from that list exists in the data, it is replaced by the profanity tag.

The final tags that replace words are the part-of-speech tags. However, not ev-ery remaining word will be replaced by a tag. For example, interjections are kept as words instead of tags, because according toLukács (2018) there is a difference

between the use of interjections by men and women. Other tokens such as punc-tuation and pronouns are kept as tokens as well and are not replaced by a tag. The tokens that are replaced consist of, inter alia, nouns, proper nouns, and verbs. These tokens provide information about the context of the text, but provide no use-ful information to this research as tokens. However, as tags they can be important features, for example how many times they occur in a text, or where in a text they occur.

The last pre-processing step consists of modification of the gender labels. For instance, for the binary male classifier, both the female and non-binary labels need to be changed to one common label. The labels are changed to gender X, where X is the gender label corresponding to the classifier (in other words, X is’M’when it pertains the male binary classifier), and NOT X, which is the label used for the two other genders.

4.2 baseline models

To have a starting point to compare all results to, a baseline model is created for every classifier. The baseline model for both the Twitter and Reddit data uses a ’most frequent’approach. This means that the class (or in this case the gender label) that appears most often in the training data is assigned to all the data of the validation set. For the binary classification models this means that the ’NOT’ class is always assigned.

The’most frequent’baseline models produces results as shown inTable 4: Gender Platform

Twitter Reddit Non-binary 0.7114 0.6780

Male 0.6489 0.6719

Female 0.6397 0.6501

Table 4: Baseline accuracy score for every classifier

The baseline results can also be found in the top row ofTable 29andTable 30in

Appendix A.

4.3 feature selection

All feature testing is performed on two different classifiers for each platform + gen-der combination. The two different classifiers, on which testing is performed, are linearsvc and logistic regression classifiers. In previous work, which included non-binary data, svm and logistic regression were popular algorithms. The best results

(14)

4.4 final models 10

was obtained using a logistics regression system with tfidf vectors (Wiegmann et al.,

2019), which in the current research are used to create a new baseline. The choice

for the two classifiers is explained more in depth inchapter 2.

For all possible features both the total as well as the average number of occur-rences has been used. This is because during the processing stages multiple posts are concatenated, this means that the total number of occurrences of a feature in a text might give the classifier a skewed impression.

For all Twitter and Reddit classifiers the following features have been tested, with all of the features scaled to a similar scale:

• Word n-grams: Several combinations of word n-grams are created by the tfidf vectorizer. For the current research only word n-grams are used, since charac-ter n-grams do not provide a lot of information due to the word replacement by tags. The best performing n-gram combination is henceforth used as the baseline which other features get combined with and the results are compared to;

• Post length: The length of a post is measured on both character as well as token level. A token is counted as any sequence of characters between white-spaces;

• Number of emojis, newlines, and punctuation:Every occurrence of an emoji, newline character, or punctuation character is counted;

• Number of uppercase and lowercase letters: The uppercase characters and lowercase characters are counted individually. For example both ’AAAaaa’ and ’A a A a A a’ are counted as 3 uppercase and 3 lowercase characters; • The amount of repetition: A sequence of characters that includes a

repeti-tion of 3 or more of one of these’aeiouy!?.’ characters, is counted as one sequence.

A feature selection process is performed to get the best-performing combina-tion of features for every classifier. This is done to both reduce overfitting on the data by the classifiers and to increase the accuracy of the classifiers. As mentioned previously, the best-performing combination of word n-grams are used as the new baseline for feature selection.

Each feature is individually combined with the n-grams. The individual results of the features can be found in Appendix A. All features that have a positive impact on the baseline accuracy are again tested in different combinations. The best performing combinations of features are used in theFinal models.

4.4 final models

To obtain the best-performing classifiers, a grid-search is used to tune the hyperpa-rameters for both the linearsvc and the logistic regression classifiers. The optimal hyperparameters, which are found by grid-search and which result in the most ac-curate predictions, are used in the models. Between the linearsvc and logictic classi-fiers the model that results in the highest accuracy is chosen as the final model. The final models with the accuracy score for the validation set can be found inTable 31

andTable 32inAppendix B.

4.5 cross genre testing

After in-genre testing, such as using a Twitter data trained classifier on Twitter data, cross-genre testing is performed, for instance the classifier trained in Twitter data is

(15)

4.6 additional research 11

used on Reddit data. Note that in this research the term’genre’is used to indicate the different social media platforms, Twitter and Reddit.

There are two reasons to perform cross-genre gender prediction: one from the side of the classifiers and one from the side of the data. For the classifiers, cross-genre prediction is used to see how universal the classifiers are, to test the porta-bility of the models. However, the biggest reason is from the side of the data, to test whether there are commonalities between the data sets from the two different platforms regarding the different genders and their writing styles. To test if the language that is used is typical for the genders on any platform, or just specifically on the platform of the training data

4.6 additional research

Following the gender prediction, a sample of the users that have been wrongly predicted is taken from each classifier. If a user is predicted wrong, this most likely means that the writing style they have is similar to that of a different gender than what the user identifies as. For this reason, each user in the sample is analyzed to determine whether there is a noticeable pattern between the wrongly predicted users.

With regards to Twitter users, the ’following’list of a user is analyzed, simi-larly to whatBamman et al.(2014) did in their research. The following list is used,

because the tweets that are written by those users appear on the timeline of the wrongly predicted user. Since those tweets are what the user will see and read, they might influence the writing style of the user.

For Reddit users the kind of subreddits a user participates in are analyzed. The subreddits a user participates in are often the subreddit they have joined. These subreddit will be added to a user’s front page ’feed’, and similarly to tweets on a twitter timeline, the posts from those subreddits is what a user will see and read, and thus might be influenced by in their writing style.

(16)

5

_{R E S U L T S}

For every classifier the same results are created: a classification report which in-cludes accuracy, precision, recall, and the f1-score, a confusion matrix, and the top contributing features. The top contributing features consist of features which have the most effect on the gender that is predicted. The order of these features is based on their coefficients. The top contributing features for all classifiers can be found in

Appendix C.

The combination of all the result types is used to get a general idea of how well the classifiers work and what factors contribute to the outcome of the classifiers.

5.1 non-binary

precision recall f1-score

NB 0.67 0.51 0.58

NOT 0.82 0.90 0.86

accuracy 0.79

Table 5: Twitter non-binary in-genre

Predicted NB NOT Actual NB 554 522 NOT 268 2383 Table 6: Twitter CM nb in-genre precision recall f1-score

NB 0.54 0.02 0.04

NOT 0.71 0.99 0.83

accuracy 0.71

Table 7: Twitter non-binary x-genre

Predicted NB NOT Actual NB 25 1051 NOT 21 2630 Table 8: Twitter CM nb x-genre precision recall f1-score

NB 0.72 0.18 0.29

NOT 0.71 0.97 0.82

accuracy 0.71

Table 9: Reddit non-binary in-genre

Predicted NB NOT Actual NB 126 568 NOT 49 1414 Table 10: Reddit CM nb in-genre precision recall f1-score

NB 0.37 0.67 0.47

NOT 0.74 0.46 0.57

accuracy 0.52

Table 11: Reddit non-binary x-genre

Predicted NB NOT Actual NB 463 231 NOT 795 668 Table 12: Reddit CM nb x-genre

The accuracy of the Twitter in-genre non-binary model is 79% and the accuracy of the Twitter cross-genre model, which is trained on Reddit data, is 71%. Table 5

and Table 7 show additional Precision, Recall and F1-scores. The Reddit models have an accuracy of 71% in-genre and 52% cross-genre. Table 9andTable 11show additional Precision, Recall and F1-scores.

While both in-genre models perform better than the baseline models, the cross-genre models perform worse. The Twitter cross-cross-genre model performs similar to the Reddit in-genre model, but this is mostly due to the predictions of those who are not non-binary which has a recall of 99%. With those predictions alone the baseline is already reached.

(17)

5.2 male 13

The Reddit cross-genre model performs quite a bit worse than the baseline, with a 16% difference. This can be slightly explained by looking at the most contributing features and comparing those from Twitter with the features from Reddit. The only top feature from Twitter that is in the top features of Reddit is’trans’, but where it is a positive related feature for the Twitter data, it is the opposite for the Reddit data. Consequently, the users who use’trans’in their posts have a bigger chance to be labeled as not non-binary by the Reddit model, but as non-binary by the Twitter model.

The top features found inAppendix C Table 33andTable 36are mostly expected. Features such as ’nonbinary’, ’binary’, ’non-’, ’-binary’, and ’they/’and ’/ them’are directly related to the name of the gender and therefore expected.

There is a surprising contradiction between the top features between Twitter and Reddit. Twitter has both’trans’and’trans noun’as a positive relation to the non-binary gender, which means that if a text contains those features it is more likely to be labeled non-binary than not, but Reddit has ’trans’as a negative relation, which means the complete opposite.

5.2 male

M 0.64 0.42 0.51

NOT 0.74 0.87 0.80

accuracy 0.72

Table 13: Twitter male in-genre

Predicted M NOT Actual M 556 753 NOT 307 2111 Table 14:Twitter CM m in-genre precision recall f1-score

M 0.46 0.25 0.32

NOT 0.67 0.84 0.75

accuracy 0.63

Table 15: Twitter male x-genre

Predicted M NOT Actual M 328 981 NOT 389 2029 Table 16: Twitter CM m x-genre precision recall f1-score

M 0.62 0.31 0.42

NOT 0.73 0.91 0.81

accuracy 0.71

Table 17: Reddit male in-genre

Predicted M NOT Actual M 222 486 NOT 136 1313 Table 18: Reddit CM m in-genre precision recall f1-score

M 0.48 0.24 0.32

NOT 0.70 0.88 0.78

accuracy 0.67

Table 19: Reddit male x-genre

Predicted M NOT Actual M 167 541 NOT 180 1269 Table 20: Reddit CM m x-genre

The accuracy of the Twitter in-genre male model is 72% and the accuracy of the Twitter cross-genre model, which is trained on Reddit data, is 63%. Table 13

and Table 15show additional Precision, Recall and F1-scores. The Reddit models have an accuracy of 71% in-genre and 67% cross-genre. Table 13andTable 15show additional Precision, Recall and F1-scores.

Similarly to the non-binary classifiers, while both in-genre models perform bet-ter than the baseline models, the cross-genre models do not. The Twitbet-ter cross-genre model performs less than the baseline by approximately 2%. This is because the pre-diction of users as male is quite poor, and the prepre-diction of non-male is worse than

(18)

5.3 female 14

the baseline. The Reddit cross-genre model performs similar to the baseline, both having an accuracy of approximately 67%.

The top features found inAppendix C Table 34andTable 37are not all obviously connected to the gender. The top features for Twitter only contain two obviously related features: ♀, which has a negative relation to the gender, and ♂, which is positively related.

The Reddit top features on the other hand are more expected. Positively related features contain for example ’/ him’ and ’he /’, which are directly related to ’male’. Negatively related features contain’/ her’,’she /’, and’they /’, which

can be seen as direct opposites of’male’.

One surprising top features is’&’for Twitter, which is negatively related. Since it is at the top of the list it is safe to assume that an ampersand is not something that is often used by males on Twitter.

5.3 female

F 0.69 0.60 0.64

NOT 0.79 0.85 0.82

accuracy 0.76

Table 21: Twitter female in-genre

Predicted F NOT Actual F 800 542 NOT 359 2026 Table 22:Twitter CM f in-genre precision recall f1-score

F 0.39 0.30 0.34

NOT 0.65 0.74 0.69

accuracy 0.58

Table 23: Twitter female x-genre

Predicted F NOT Actual F 404 938 NOT 632 1753 Table 24: Twitter CM f x-genre precision recall f1-score

F 0.67 0.48 0.56

NOT 0.76 0.87 0.81

accuracy 0.74

Table 25: Reddit female in-genre

Predicted F NOT Actual F 363 392 NOT 179 1223 Table 26: Reddit CM f in-genre precision recall f1-score

F 0.78 0.07 0.13

NOT 0.66 0.99 0.79

accuracy 0.67

Table 27: Reddit female x-genre

Predicted F NOT Actual F 52 703 NOT 15 1387 Table 28: Reddit CM f x-genre

The accuracy of the Twitter in-genre male model is 76% and the accuracy of the Twitter cross-genre model, which is trained on Reddit data, is 58%. Table 21

and Table 23show additional Precision, Recall and F1-scores. The Reddit models have an accuracy of 74% in-genre and 67% cross-genre. Table 21andTable 23show additional Precision, Recall and F1-scores.

Similarly to the the previous classifiers, while both in-genre models perform better than the baseline models, the Twitter cross-genre model does not. The Twitter cross-genre model performs less than the baseline by approximately 6%. This is because the prediction of users as female is quite poor, similarly to the male cross-genre models, and the prediction of non-female is worse than the baseline.

(19)

5.4 additional results 15

The Reddit cross-genre model does outperform the baseline (which is 65%) with its accuracy of 67%. However, this accuracy is similar to the accuracy of the baseline model which uses n-grams.

The top features found inAppendix C Table 35andTable 38are not all obviously connected to the gender. The top features for Twitter only contain two obviously related features, which both contain♀. One of the negatively related features is the use of profanity. This is slightly expected due to a study about cursing on Twitter done by Wang et al. (2014), which has been mentioned before in chapter 4,

Pre-processing. Wang et al.(2014) state that men curse more frequently than women,

which might explain the negative relation to profanity from the female classifier. The majority of the Reddit top features can be expected. Features related to female pronouns are positively related, while pronouns related to both the male and non-binary gender are negatively related to the female gender. One surprising aspect of the Reddit top features is the top two features. These features are both related to the use of newline characters in text. The fact that the use of newlines is positively related to men might be an explanation for this.

5.4 additional results

As explained inchapter 4,Additional research, further research is done on a sample of wrongly predicted users for every classifier, to see whether there is a noticeable pattern. For Twitter this means mostly looking at the following list of a user and see if there is a pattern, for example whether a female who is predicted male follow mostly males. For Reddit this means looking at the subreddits a user participates in, or has joined, since some subreddits are seen as typically male, or typically female. Analyzing the ’following’ lists of the wrongly predicted Twitter users only reveals one trend. This trend can be found primarily among those users that identify as male. These users are wrongly predicted mainly as non-binary and occasionally as female. The common theme among these users is that almost all follow a large amount of LGBTQ+ and dragqueen related accounts.

Analyzing the subreddits which wrongly predicted Reddit users participate in, there is one outstanding trend among them. Both non-binary users as well as male users that participate primarily in gaming subreddits are classified as their non-gender. Looking at the female classifier reveals that those who are not female are often classified as such when they primarily participate in gaming and memes sub-reddits. This is quite remarkable, since especially the gaming subreddits are seen as typically male related subreddits.

For both the Twitter analysis and the Reddit analysis these are the only clear patterns that stand out. Other than that Twitter users seem to follow random users, or at least users that do not directly have a clear link to why a user is wrongly predicted. Among Reddit users there also seems to be no other pattern with users participating in certain subreddits.

(20)

6

_{C O N C L U S I O N}

The research aimed to answer the following research question: Does gender influ-ence the language and writing style that is used on social media? Furthermore, this research aimed to answer a number of sub-questions: (1) How universal is the effect of gender on language? (2) Is there a noticeable pattern among wrongly predicted users?

The results in chapter 5 show that even though there are some features, that contribute universally to the prediction of gender, the majority of the features does not. It is interesting to note that with cross-genre classification It is only with non-binary classification that a model trained on Reddit data and tested on Twitter data outperforms the model trained on Twitter data and tested on Reddit data. For male and female classification the opposite is true.

As stated in the Additional results, for both Twitter and Reddit there is one pattern that stands out among the wrongly predicted users. For Twitter this pat-tern consists of the ’following’ list including a large number of LGBTQ+ and drag queen related accounts. This pattern applied primarily to wrongly predicted men who are mainly predicted as non-binary. For Reddit the pattern consists of both male and non-binary users participating in subreddit concerning gaming and memes. These subreddits are seen as typically male related, but the wrongly pre-dicted users are primarily labeled as female.

Taking into account all the results in chapter 5and Appendix C, it can be con-cluded that gender influences the language and writing styles that are used on social media to a certain extent. The influence of gender is made especially clear in the top contributing features for every gender and platform. The non-binary Twit-ter classifier has a number of typically non-binary related words as the top features. Even though this is not necessarily the case for the male and female classifiers, these still contain things such as the male (♂) and female (♀) symbols, as well as profanity use which can be linked to genders.

The influence of gender, and especially gender related words, is evident in the top features of the Reddit classifiers. The majority of the top features can be classi-fied as pronouns, and particularly in the case of the non-binary classifier, there are features that are clearly related to the gender such as’binary’.

However, the influence of gender is limited and there are other factors that play a role in the language and writing style of users on social media. One of the factors that has been made clear from the results of the cross-genre classifications and the top features for the classifiers, is the social media platform itself. Whether a user creates a post on Twitter, or on Reddit does affect their language use.

6.1 discussion

There is a reasonable chance that (the accuracy of) the models would improve when context words are used as features instead of only their corresponding tags. How-ever, that is not where the interest of this research is. This research is based more on writing style and word order instead of topics and context words.

There are a number of limitations to the current work. The first is a limitation of the data. To get a proper amount of text for predictions, it was necessary to combine multiple tweet for Twitter, and multiple posts and comments for Reddit. This might

(21)

6.2 future work 17

give a skewed view on the writing style of people, since these concatenated pieces of text are now seen as a single piece of text.

Second, the method in which the data is split into training, validation, and test sets might not be the best method to use. As a results of the method that is used, the same user can occur in all three sets. While not proven, or looked into, this may cause some bias in the classifiers (even though usernames are not used for the purpose of gender prediction).

Third, one of the data processing methods used is separating the emojis, with the main reason of counting all emojis as separate tokens. The problem with this method is that emoji modifier sequences are separated as well. As a result things such as skin colour are separated from the emoji the skin colour belongs to. This might give the wrong impression of both the number of emojis as well as the kind of emojis that are used.

Finally, regarding theAdditional research. The small portion of the research that is focused on the explanation for why a subset of users is predicted incorrectly, is now based on impressions of which users Twitter users follow and which subred-dits a Reddit users participates in, rather than actual numbers.

6.2 future work

First, as mentioned in the Discussion above, research into why users have been incorrectly predicted is now based on impression rather than numbers. In future work more in depth research based on facts may shed a clearer light on the reason for the wrong predictions.

Second, the original method of collecting Twitter data was different from the method that has actually been used for the current research, as explained in chap-ter 3. The original plan was to scrape random users from Twitter and then look at their username and bio for a self-proclaimed gender, or pronouns. This method turned out to be quite unrealistic given the time available. In future work it might be an idea to try this method, since it ensures a more random group of people rather than what results of a search query provide.

Third, as mentioned in theDiscussionabove, users can occur in one or more of the training, validation, and test sets. In future work it might be an idea to make sure that users occur in only one of the three sets.

Finally, it might be interesting for future work to test out text similarity. This idea was originally dismissed for the current research, because the interest was not in the topics or specific words that are used, that might be similar to other users. However, with words replaced by their part-of-speech tags the texts are more generalized and it might be interesting to use the similarity between texts. For example, maybe texts by non-binary users are very similar to each other, which would help the classifier identify non-binary users more easily.

(22)

B I B L I O G R A P H Y

Bamman, D., J. Eisenstein, and T. Schnoebelen (2014). Gender identity and lexical variation in social media. Journal of Sociolinguistics 18(2), 135–160.

Darics, E. (2013). Non-verbal signalling in digital discourse: The case of letter repe-tition. Discourse, Context & Media 2(3), 141–148.

Krüger, S. and B. Hermann (2019). Can an online service predict gender? on the state-of-the-art in gender identification from texts. In 2019 IEEE/ACM 2nd Inter-national Workshop on Gender Equality in Software Engineering (GE), pp. 13–16. IEEE. Lee, K. (2020, Jun). The ultimate list of social media acronyms [eli5, ftw]. https:

//buffer.com/library/social-media-acronyms-abbreviations/#index.

Lukács, B. (2018). One-word interjections as discourse markers in female and male speakers’ academic talk: A case study based on the michigan corpus of academic english.

Pinson, L. (2018, Nov). 101 social media acronyms and

ab-breviations (and what they mean!). https://stylecaster.com/ social-media-acronyms-abbreviations-what-they-mean/.

Sehl, K. (2019, Apr). 101 social media acronyms and abbreviations for marketers. https://blog.hootsuite.com/social-media-acronyms-marketers-know/. Wang, W., L. Chen, K. Thirunarayan, and A. P. Sheth (2014). Cursing in english on

twitter. In Proceedings of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 415–425.

Wiegmann, M., B. Stein, and M. Potthast (2019, September). Overview of the Celebrity Profiling Task at PAN 2019. In L. Cappellato, N. Ferro, D. Losada, and H. Müller (Eds.), CLEF 2019 Labs and Workshops, Notebook Papers. CEUR-WS.org.

(23)

A

F E A T U R E S

Baseline 0.7114 0.6489 0.6397

Features Twitter Non-Binary Twitter Male Twitter Female LinSVC Log Reg LinSVC Log Reg LinSVC Log Reg Baseline (TFIDF 1,1) 0.7651 0.7621 0.6964 0.6886 0.7262 0.7205 TFIDF 1,2 0.7758 0.7742 0.7015 0.7122 0.7337 0.7326 TFIDF 1,3 0.7742 0.7718 0.7055 0.7117 0.7356 0.7399 Characters total 0.7745 0.7742 0.7063 0.7117 0.7356 0.7404 Characters avg 0.7742 0.7742 0.7063 0.7117 0.7350 0.7399 Length total 0.7745 0.7761 0.7055 0.7093 0.7361 0.7401 Length average 0.7742 0.7750 0.7058 0.7085 0.7356 0.7399 Emojis total 0.7756 0.7748 0.7050 0.7114 0.7356 0.7399 Emojis average 0.7756 0.7748 0.7050 0.7114 0.7358 0.7401 Uppercase total 0.7761 0.7769 0.7071 0.7098 0.7364 0.7399 Uppercase average 0.7758 0.7772 0.7071 0.7098 0.7364 0.7396 Lowercase total 0.7753 0.7748 0.7058 0.7122 0.7358 0.7404 Lowercase average 0.7740 0.7732 0.7060 0.7128 0.7358 0.7407 Punctuation total 0.7761 0.7748 0.7055 0.7111 0.7361 0.7401 Punctuation average 0.7753 0.7742 0.7060 0.7117 0.7358 0.7401 Newline total 0.7783 0.7758 0.7090 0.7133 0.7434 0.7417 Newline average 0.7772 0.7758 0.7042 0.7136 0.7345 0.7415 Repetition total 0.7758 0.7734 0.7055 0.7114 0.7369 0.7391 Repetition average 0.7758 0.7734 0.7017 0.7117 0.7323 0.7315

Table 29:Accuracy baseline and all features using linearsvc and logreg classifiers on the Twitter validation set

Baseline 0.6780 0.6719 0.6501

Features Reddit Non-Binary Reddit Male Reddit Female LinSVC Log Reg LinSVC Log Reg LinSVC Log Reg Baseline (TFIDF 1,1) 0.6715 0.6756 0.6701 0.6742 0.6594 0.6566 TFIDF 1,2 0.6821 0.6961 0.6821 0.6798 0.6645 0.6687 TFIDF 1,3 0.6817 0.6942 0.6821 0.6863 0.6729 0.6715 Characters total 0.6849 0.6970 0.6826 0.6863 0.6724 0.6729 Characters avg 0.6849 0.6965 0.6821 0.6863 0.6729 0.6729 Length total 0.6849 0.6965 0.6826 0.6863 0.6719 0.6729 Length average 0.6845 0.6965 0.6821 0.6858 0.6719 0.6729 Emojis total 0.6821 0.6956 0.6821 0.6872 0.6729 0.6715 Emojis average 0.6821 0.6956 0.6821 0.6872 0.6729 0.6719 Uppercase total 0.6835 0.6961 0.6826 0.6863 0.6729 0.6719 Uppercase average 0.6835 0.6961 0.6826 0.6863 0.6729 0.6719 Lowercase total 0.6840 0.6970 0.6821 0.6863 0.6719 0.6729 Lowercase average 0.6835 0.6965 0.6826 0.6863 0.6719 0.6729 Punctuation total 0.6835 0.6956 0.6826 0.6863 0.6719 0.6729 Punctuation average 0.6835 0.6956 0.6821 0.6858 0.6715 0.6733 Newline total 0.6845 0.6951 0.6961 0.6882 0.7234 0.6914 Newline average 0.6840 0.6956 0.6979 0.6858 0.7429 0.6835 Repetition total 0.6812 0.6961 0.6826 0.6877 0.6719 0.6729 Repetition average 0.6812 0.6961 0.6826 0.6807 0.6654 0.6701

Table 30:Accuracy baseline and features using linearsvc and logreg classifiers on the Reddit validation set

(24)

B

F I N A L M O D E L S

Gender Classifier C Features Val acc

Non-binary LinearSVC 1.0 1,2 n-grams, total punctuation & newlines 0.7868 Male Logistic Regression 5.0 1,2 n-grams, total & average newlines 0.7130 Female Logistic Regression 5.0 1-3 n-grams, total newlines, emojis & punctuation 0.7431

Table 31: Final classification models used for Twitter

Gender Classifier C Features Val acc

Non-binary Logistic Regression 1.0 1,2 n-grams, total length in characters 0.7151 Male LinearSVC 0.5 1,2 n-grams, average newlines, total repetition 0.7118 Female LinearSVC 1.0 1-3 n-grams, total & average newlines 0.7527

Table 32: Final classification models used for Reddit

(25)

C

T O P F E A T U R E S

c.1

twitter

Feature Coef :d 3.242692 ! <hashtag> -2.582928 nonbinary 2.194955 out here 1.954505 trans 1.896555 here for 1.857288 abbrev 1.770052 <user> ! -1.736407 trans noun 1.733283 they / 1.717704

Table 33:Twitter NB top

features Feature Coef & -5.661146 ♀ -4.761105 . <user> -4.441345 @ propn 4.341125 . <link> 4.192333 “ <user> 3.764484 ! <hashtag> 3.730395 ♂ 3.721109 <link> verb -3.688682 , , 3.655729

Table 34: Twitter male top

features Feature Coef profanity -6.796087 . 6.175827 <hashtag> 5.934581 ♀ 4.642095 <user><hashtag> 4.498534 <user> 4.341125 ! 4.156563 & 4.045302 ♀ 3.985470 <hashtag><link> 3.835130

Table 35: Twitter female

top features

c.2

reddit

Feature Coef they/ 3.378002 /them 3.357326 binary 3.080699 ! 2.433277 non 2.360451 -binary 2.319083 or 1.926152 this noun -1.900544 trans -1.861915 ♥ 1.733167

Table 36:Reddit NB top

features Feature Coef newavg 6.746484 / him 2.656289 he / 2.554627 / her -1.336064 ♥ 1.315096 propn . 1.305317 she / -1.269831 they / -1.265309 verb yourself -1.252048 verb its 1.251203

Table 37: Reddit male top

features Feature Coef newtot -13.184200 newavg -12.985228 she / 2.118824 ... 2.043446 / her 1.999258 she / her 1.818652 / them -1.602672 he / -1.570829 noun ... 1.541830 they / them -1.539389

Table 38: Reddit female

top features