A more qualitative approach to personality profiling on Twitter

(1)

A more qualitative approach to personality profiling on

Twitter

!

SUBMITTED IN PARTIAL FULLFILLMENT FOR THE DEGREE OF MASTER

OF SCIENCE

Frank Houweling

10199969

M

ASTER

I

NFORMATION

S

TUDIES

H

UMAN-

C

ENTERED

M

ULTIMEDIA

F

ACULTY OF

S

CIENCE

U

NIVERSITY OF

A

MSTERDAM

June 20, 2015

1st_Supervisor ₂nd_Supervisor

dr. Maarten Marx MSc. Christophe van Gysel

(2)

A more qualitative approach to personality profiling on

Twitter

[Master Thesis]

Frank Houweling

University of Amsterdam Master Information Studies Human Centered Multimedia

frank.houweling@student.uva.nl

ABSTRACT

In di↵erent fields, the interest in author profiling: determin-ing demographic features for an author is growdetermin-ing. Complex features without a ground truth like (perceived) personality require another approach then traditional author profiling. In this research, a gold standard that is constructed using the personality test by Rammstedt and Oliver (2007) is com-pared with a new author-profiling method. This method consists of analysis of all information available about a twit-ter profile, using measures that are based on personality characteristics found in existing research. While the person-alities that result from this method di↵er greatly from the set gold standard, it performs almost equal on a secondary human evaluation, suggesting that future work is necessary to determine if there is such thing as an objective or mean personality.

General Terms

Author Profiling, Text Analysis, Personality, Psychometrics, Twitter

1. INTRODUCTION

In di↵erent fields the interest in author profiling is growing. Author profiling is an application of text analysis which goal it is to, given a piece of text written by a person, predict dif-ferent descriptive features of this person (the author). These features can then be used to build a demographic descrip-tion of the author (an author profile) which promises many possibilities for business intelligence, criminal law, computer forensics and more.

Previous author profiling research was focused on predicting straight forward features like age and gender. But, in recent developments there is also a growing amount of research on more complex features like author personality.

Personality is an interesting descriptive feature, which opens

new possibilities to apply author profiling technology. For example in politics, where party leaders often try to portray a certain type of personality to the voters, in order to become likable candidates [5]. With author profiling systems, the process of determining personality can potentially turn out to be faster, easier and more reliable than existing (manual) methods.

Existing research on author profiling is characterized by a rather uniform methodology. A supervised classifier is trained using a ground truth data set of authors’ person-alities and a broad range of mostly textual features. What sets our approach apart from previous research, is a strongly reduced amount of features, which are founded by existing research. Because the features are based on observations in previous research, they have the potential to perform better then normal textual features. Next to that, they have the potential of a stronger external validity.

During this research the political case described earlier is used as the research setting, and the politicians’ Twitter ac-counts as the available data source. During the design of the author profiling system, the following main question is posed: (1) ”Can we develop a system that e↵ectively deter-mines perceived personality based on all information avail-able about a Twitter profile via the API, by using personal-ity characteristics found in existing research?”.

In this paper, firstly, a literature review is conducted to an-swer the following subquestions: How can personality be rep-resented ?, What are proven methods to determine ity?, Which characteristics can be used to describe personal-ity? and How can we evaluate such a method of personality determination?.

Based on this theoretical background, an experiment is de-signed in which an author profiling system is evaluated. This author profiling system is based on characteristics found in the literature review and the data available via the twitter API. Finally, using the results of this experiment we will try to answer the main research question.

2. THEORETICAL BACKGROUND

The main focus of our author profiling system is to deter-mine personality. But, personality is not such a precisely delimited concept as might be expected. The meaning of

(3)

the term personality is very much dependent on the context in which it is used. In general, there are three main types of personality that are related but not necessarily always equal.

2.1 Different types of personality

The first type of personality is the personality a subject assigns to him or herself, also called self-rated personality [16]. This is the type of personality that is most used in research, and is gathered often with so-called personality markers. These are statements about a persons personality where the person can then agree or disagree to a certain scale.

The second type of personality is the type of personality an expert assigns to a person via observation, also called peer-reported personality. During an observation, the ex-pert looks at the person’s behavior and tries to find specific actions that in existing literature are linked to specific per-sonality traits.

The third and last type of personality is the type of per-sonality people around a person assign to him or her, also called peer-rated personality [16]. Just like with self-rated personality, peer-rated personality is often determined using personality markers. In the case of peer-rated personality, these statements are not rated by the person him or herself, but by people who know the target person.

With author profiling, the goal is to determine one of these types of personality by analyzing the media created by the person. In the political case introduced in this research, peer-rated personality is the type of personality which is most interesting, where this represents how voters see the politician.

2.2 Representing personality with the big five

There are two main methods to represent personality. The first method is the description of a person’s personality in free text, without any limitations on what terms can be used. This is a very extensive method that enables a researcher to give in-depth insights in the person’s personality. But,this method is not desirable for many cases like the one used in this research, as it is not well suited for comparison between users or calculations. Because of that, multiple quantitative methods are developed that are usable for calculations and comparison.

The big five is an example of such a a quantitative represen-tation, which is very often applied in both psychometric and author profiling settings [19, 28]. With the big five method, a person receives a ranking (often between 0 and 5) on five major personality traits, that should give a good overall view of his or her personality. The five personality traits are openness to experience, conscientiousness, extraversion, agreeableness and neuroticism.

The big five representation is the standard method to de-scribe personality, where it was found to work consistently with multiple ages, cultures and determination methods [8]. Radar plots are an e↵ective way of visualizing a big five representation [25]. An example of such a radar plot

repre-senting two di↵erent personalities can be seen in figure 2.

2.3 Determining personality

To find these five scores that make up a big five representa-tion, di↵erent methods can be used.

2.3.1 Computer science: Author Profiling

In the introduction, the methodology used in previous au-thor profiling research is already shortly discussed. As de-scribed, determining personality is seen as a normal author profiling task, just like age and gender.

Most existing author profiling research is performed to be submitted to the PAN congress author profiling task [28]. For this task, a training data set is given with authors, de-mographic features of these authors, and their tweets. There is only a limited set of metadata available, where the real Twitter account is obfuscated for privacy reasons.

This author profiling task focuses on determining several demographic features at the same time. And to limit the complexity of the system, the same approach is used to pre-dict all of these demographic features.

The common approach in author profiling consists of cal-culating a broad range of features from the author’s set of tweets. Common types of features are: textual / con-tent features (N-grams, TF-IDF, slang words, swear words, LIWC features [14]), stylistic features (punctuation marks), based features (the amount of tweets in a specific time-frame) and social network features (network size, density [28, 13]). In most research, textual features are far the most important.

After the calculation of these features, one classifier per to-be-determined demographic feature, often an SVM, is trained [28]. Because the big five representation consists of five values, five separate SVM classifiers are trained. One successful extension to this method is the second order attributes (SOA) method described by L´opez-Monroy et al. [22]. This method does use an SVM, but does not stick to the di↵erent possible classes for classification. Instead, sub classes are generated for each class using the training data and K-Means clustering. This results in more classes which are used in the classification.

2.3.2 Psychometrics: Personality tests

The field of psychometrics already had extensive experience with measuring personalities, far before anyone in the field of computer science tried to do the same. In psychometrics, a broad range of personality tests are developed. Such tests often work with a set of descriptive sentences[11] or single adjectives[15] (also called markers), which are then displayed to the rater. The rater can then agree or disagree with these descriptive sentences or terms using a likert scale. When the rater agrees, he thinks the descriptive suits the person well. These tests greatly di↵er in the time that is needed to com-plete them. Some descriptive sentence based tests can take up to 40 minutes to complete[11], where the short set of markers by Rammstedt (2007) takes only a few minutes [27].

(4)

2.3.3 Psychology: observation and characteristics

of personality traits

The questions that construct such a personality test are of-ten founded in psychological research. In such research, ob-servations are used to find new traits and characteristics that make up the big five.

There is a significant amount of research about the di↵erent personality characteristics that describe personality traits well. When a person is highly associated with these charac-teristics, the person also scores high for the corresponding personality trait and visa versa.

Because Twitter author profiling provides new ways of de-termining a personality, it is a good idea to look at all char-acteristics that are applied to describe a personality trait, in stead of only focusing on the ones often found in personality tests. It might be the case that some of these characteristics are hard to measure in a test, but can be e↵ectively applied in a twitter based author profiling system.

The following characteristics are found to describe specific personality traits well, and are applicable on Twitter data. These characteristics form the foundation of the twitter anal-ysis system that is introduced in this research.

Openness to experience

Openness to experience can be summarized into six dimensions: imagination, aesthetic sensitivity, atten-tiveness to inner feelings, preference for variety and intellectual curiosity [9].

Butler (2000) notes that people who are more open to experiences are in general more open to di↵erent cul-ture and lifestyles [4]. McCrae (1996) describes that it is likely that this openness to di↵erent cultures and preference for variety can be seen in the friend net-work of the twitter account of a person, but does not elaborate on how it can be seen [24]. It is however a reasonable hypotheses that a user that is open to dif-ferent lifestyles and opinions is more likely to follow a diverse group of people.

Extraversion

Thompson (2008) described extraversion with terms like: talkative, outgoing and energetic behavior, and introversion (the inverse) with quiet, reserved and shy [30]. In short, extraverts enjoy social interactions. This can be seen on social media, where people who are extravert have a significantly larger group of friends and contacts. [1]

Agreeableness

Agreeableness is a personality trait that is most often described with characteristics as sympathetic, kind, warm and considerate [30]. People are often found to be sympathetic or kind because of the way they com-municate. People who are found kind and sympathetic are seen as such because of non-verbal clues. Which clues are considered polite and nice, is very culturally dependent [20].

On Twitter, there are of course no real non-verbal clues. Instead of verbal clues, people on Twitter of-ten use emoticons to support the transfer of emotions

[21], or make emotions explicit by writing them down [17].

Neuroticism

Neuroticism is often called emotional stability, where it describes how frequently a person has to cope with negative emotions like depression, anxiety and guilt [18]. Not only does a neurotic person experience more anger, he is also more likely to express this anger [23]. Someone who scores low on neuroticism is more emo-tionally stable and calm, and will experience less stress. Conscientiousness

Conscientiousness is often described as the personality trait of being thorough and careful [30]. People who score high on conscientiousness have a desire to do a task well.

They spend more time on perfecting a task, are more careful and have more regret of the things they have done wrong. On social media, they often put more time in what they post [29].

2.4 Evaluation

2.4.1 Evaluation in Author Profiling

For evaluating author profiling systems, the same evaluation approach is used for evaluating systems that determine age and gender, as for systems that determine personality. The results from the author profiling system are compared with a ground truth, which is a set of results that were independently checked. The accuracy of the author profiling system is then used as the measure to evaluate a system, and to perform between-system comparisons [28]. An author profiling system that has a higher accuracy on the test set then another author profiling system is thus better. There are two reasons why this approach does not fit person-ality very well. Firstly, when using a boolean measure for good and false predictions by the author profiling system, we ignore the di↵erence between completely wrong and only slightly o↵. If we take the case of an author that scores 5 out of 5 on a specific big five dimension (ground truth), and two author profiling systems that pose 1 and 4 as result re-spectively. Then they are both as wrong for the accuracy measure, where it would be better to use a measure that takes into account which one is more wrong. This can be fixed relatively easy by using the mean squared error. Secondly, creating a ground truth for personality is not as straight forward as for age or gender. In the case of Verho-even (2014), to determine this ground truth, a normal big five personality tests is used [32]. When evaluating a new method by comparing it to such a ground truth we should keep in mind that there is no objective ground truth per-sonality for any author. Every perper-sonality measure that is defined, can be successfully argued to be better then the old one. Simply because the definitions of the big five dimen-sions are not water tight [3]. Because of that, the ground truth can better be called a gold standard in the case of personality.

The result of this change for the evaluation is that it can not simply be said that when a system gives results that are not

(5)

equal to the gold standard, these results are wrong. This comparison can only be used to see if they are comparable to the gold standard personality test, but a di↵erent result can both be worse or better. To make the distinction be-tween these two, some kind of secondary evaluation should be performed.

2.4.2 Evaluation in psychometrics

The difficulties of evaluating and validating personality of course also has implications in psychometrics. In psycho-metrics, a few personality tests are generally seen as well validated methods of determining personality. These meth-ods are then used as a ’gold standard’, the most accurate test currently available. One of the personality tests which often is argued to be the best available is the Revised NEO Personality Inventory (NEO PI-R) [7].

NEO PI-R is seen as very reliable because of the extensive amount of research that went in validating it. Because there is no ground truth, NEO PI-R and other personality traits are often validated on how useful they are in research. NEO PI-R was found to be useful in research where it scored well on internal consistency and retest reliability [7]. Also, it scored high on criterion validity [6], which is a measure spe-cific to psychometric which measures to what extend the outcomes of the personality test relate to a behavioral cri-terion which is generally agreed upon by research [26]. Newer personality tests are evaluated by their agreement to a personality test that is already thoroughly validated. For example, the markers described by Rammstedt (2007), which are used in this researched, are validated by the cor-relation between the results of the new test and the results of the proven NEO PI-R test [27].

Author profiling research that is focused on measuring per-sonality should base it’s evaluation on this method, where it takes into account that there is no objective personality measurement, and focus on how well a measurement fits is purpose. In this research, the purpose is to measure what the perception of one’s personalty is by the public, and this perception should thus, next to the gold standard compari-son, in some way be evaluated by the public.

3. METHOD

As described in the introduction, the goal of this research is determining the personality of specific politicians by analyz-ing their twitter account usanalyz-ing an author profilanalyz-ing system. To be able to evaluate the personality scores that result from that analysis, a gold standard has to be constructed where the values can be compared with.

Peer rating is currently the most frequently used method of retrieving personality, and because of that it is used to set the gold standard of personality scores.

3.1 Politician selection

Before these peer ratings where gathered, politicians had to be selected who were well-known, public figures. Otherwise, it would have been significantly harder to find enough raters who would be able to judge them.

VV D CD A D66 Gro enL inks _PvdA SP _PVV Othe rpa rtie s 2 3 4 5 5 5 5 5 4 3 2 5 # o f p o li ti ci a ns

Figure 1: Distribution of selected politicians over dutch po-litical parties.1

The first selected politicians were politicians in leading and public roles. This is because they are often in the media and as a result of that are very well known by the general public This resulted in a slight over-representation of the ruling parties and ex-ruling parties, that provided the cur-rent and previous ministers. After these politiciansm the party leaders of all major political parties were also added. Parties who were very active on social media like Groen-Links, were somewhat over-represented, because it was rela-tively easy to find politicians with an active Twitter account. Less often represented in the politician selection were parties like the SP and 50PLUS, who were generally more focused on older generations, and less active on social media. PVV was also underrepresented compared to the party’s size, because of the lack of well-known politicians that rep-resent the party.

To compensate for some of the over- or under-representation, political figures who no longer carry out a public role, but are still well known, where added. Examples of these polit-icans are Boris van den Ham and Erica Terpstra. The final division of selected politicians over parties can be seen in figure 1 (.

3.2 Gold standard peer rating

As a gold standard to compare our system with, peer raters were asked to rate the personalities of 35 politicians. These ratings consisted of peer-rated markers, for which the raters had to fill in to which extend they agreed with them. To make sure that many ratings could be gathered in a rel-atively short time, the short markers by Rammstedt (2007) were used [27]. The markers were translated from German 1_{50PLUS, SGP, PvdD, Denk and ChristenUnie (all one} politician) are combined in other parties.

(6)

and English to Dutch, for the ease of the raters. The trans-lations are displayed in table 1.

The raters consisted of a random group of people who were asked to fill in an online questionnaire.

Not every rater had to rate all politicians. For each rater, a random selection of three politicians was made randomly. Whenever a person indicated to not know one of the politi-cians, another random one was given for the participant to rate. At the end of the survey, general demographics (age, political involvement, political preference, age, gender and educational level) from the rater were gathered.

3.3 Data selection and measures

The Twitter analysis consists of accessing the data available on the politicians’ Twitter accounts, and using the char-acteristics described in the theoretical background to find measures to transform this data into meaningful personality scores.

Because Twitter is such an open system, a wide range of data is available using the public API. The data is divided between two main objects: users and tweets, which contain a broad range of attributes. Other objects (such as places and entities) were not used in this research. The full list of available attributes for both the user and the tweets objects can be found in the Twitter API Documentation [31]. It should be kept in mind that Twitter limits the amount of API requests per system. These limits di↵er per type of request, and retrieving some types of data requires a lot of time. Because of this reason, measures that depended on twitter following graphs were skipped. There was simply not enough time to retrieve all required data.

The characteristics found in the literature review were ap-plied on the data available on Twitter, resulting in the fol-lowing measures:

3.3.1 Openness to experience

As described in the theoretical background, people who score high for openness to experience are generally more open to di↵erent opinions, cultures and lifestyles and also often have a preference for variety in the people they talk to and the media they consume.

As a consequence of this, a person who has a preference for media and intellectual variety, might also have a preference for more variety in the people they follow on Twitter. One way we can measure this variety in Twitter followers is by using metadata publicly available on Twitter. On Twit-ter, users are asked to write a short summary about their interests and occupation. It is a Twitter custom to fill this with either a short description, or a comma separated list of keywords that describe the these topics. This informa-tion in combinainforma-tion with simple text analysis can be used to describe the biography interest similarity of two users. The average biography interest similarity is calculated as follow-ing: P f2FuCOSSIM (Bf, BU) |FU| (1) where

• U is the target Twitter account

• F U is the list of following Twitter accounts for a given Twitter account

• and B U is the bibliography feature vector for a given user account

The bibliography feature vector exists of simply all words that occur in the biography together with their relative fre-quencies. Words were split on spaces and punctuation char-acters.

3.3.2 Extraversion

Extraversion is the personality trait of being outgoing and energetic, and often associated with a higher amount of friends and contacts.

On Twitter, ’becoming a contact’ is a one-way interaction. In contrary to other social networking sites, following a per-son does not require the permission of the other perper-son. But, the other person is also not inclined to follow back. Because of this reason, a politicians contacts are more likely the people he follows then the ones he is being followed by. Also, the simple count of these friends is not a fair measure of his outgoingness. It is given that the friend network of any social networking site grows in the time that he or her is a member. In the measure of extraversion, the amount of friends should thus be compensated towards the age of the Twitter account.

With this in mind, the normalized relationship quantity is defined as:

log( Uf EXP (Ua)

) (2)

where

• U is the target Twitter account • U a is the account age in days

• U f is the amount of users that account is following • and EXP(x) is the expected amount of Twitter

follow-ers for a given age in days

The expected amount of Twitter followers is found with a short empirical research, where a sample of 904 random dutch Twitter profiles was analyzed to find a function that

(7)

Table 1: Dutch translations of the 10 mini markers of Rammstedt and John (2007) used for gold standard peer rating.

English Dutch

... is reserved ... terughoudend en gereserveerd is

... is generally trusting ... mensen veel vertrouwen geeft, het goede ziet in mensen ... tends to be lazy ... comfortabel is, geneigd tot luiheid

... is relaxed, handles stress well ... relaxed is, goed met stress om kan gaan ... has few artistic interests ... weinig artistieke interesses heeft

... is outgoing, sociable ... uitgaand en gezellig is

... tends to find fault with others ... de neiging heeft om anderen te bekritiseren ... does a thorough job ... grondig te werk gaat

... gets nervous easily ... gemakkelijk nerveus en onzeker is

... has an active imagination ... een actieve verbeelding heeft en fantasierijk is

represents what amount of followers would be expected from a certain account age. From these Twitter accounts, the ones with more then 8000 following users were removed, where they are obviously spam accounts. A normal person would not very likely have more then 8000 contacts.

On this dataset, a logarithmic function was fitted using re-gression to represent the amount of followers based on a Twitter account’s age in days. The resulting function for the expected amount of Twitter followers, given the account age, is thus: EXP (x) = 452.3626678331251+146.9562760301834⇤ log(x) and R2= .032.

3.3.3 Agreeableness

Agreeableness is the trait of being kind. As described in the literature review, being considered kind or warm is mostly dependent on non-verbal communication. Because non-verbal communication is often replaced with emoticons on the in-ternet, the assumption is made that agreeableness can be measured by the use of positive emoticons in communica-tion to other persons. Another clue of kindness might be the use of words that indicate kindness: like ”thank you”. Following this assumption, the direct tweet kindness index is the percentage of directed tweets (that is tweets that are sent to a specific user) that contains at least one of the terms that indicate kindness. These terms are defined as the list of positive smileys: :), ;) etc. in combination with ”thanks”, ”thank you” etc.

3.3.4 Neuroticism

On social media, people who score high on neuroticism will more often express negative emotions. Because of that, mea-suring how often negative emotions reflect from posts will be a valid way to determine neuroticism.

To do that, we first need a list of words that describe such negative emotions. Because no extensive list in Dutch was found, an existing english list was translated.2

Using this list of negative emotion words, the following for-mula is used to describe neuroticism:

2_Negative _Emotions _list _on negativeemotion-slist.com. Translated version can be found on https://gist.github.com/FrankHouweling/7fce4b89da4357744054 X t2TU P w2Wt ( 1 if w in N 0 otherwise |Wt| (3) where

• U is the target Twitter account • T u is the list of tweets for a user • W t is the list of words in a tweet • N is the list of negative words

3.3.5 Conscientiousness

People who score high on conscientiousness tend to spend more time on writing posts on social media. By spending more time on perfecting these posts, they will most likely make less spelling errors.

Measuring the amount of spelling errors in a text can pose a way to find the conscientiousness. The degree of spelling errors is defined as:

X t2Tu P w2Wt ( 1 if spelling error 0 otherwise |Wt| (4) where

• U is the target Twitter account • T u is the list of tweets for a user • W t is the list of words in a tweet

To check the spelling of a specific word, the complete tweet was used as the input for ASpell, an often-used open source spelling checker, using the default dutch dictionary. To make results on tweets better, @-mensions and hashtags where re-moved, where usernames and hashtagged terms often not

(8)

Figure 2: Spider representing the big five personality scores for Twitter (blue) and the gold standard survey (red) for Lodewijk Asscher.

represent any real-life words. Because even with these ad-justments, a lot of false positives occurred, the 250 most fre-quently falsely spelled words were analyzed manually, and transformed into an addition to the default dictionary. This additional dictionary consists mostly of words that are very specific to politics, and are therefor not part of any normal dictionaries.

3.4 Method of evaluation

In the theoretical background, two shortcomings in the eval-uation of previous author profiling systems where discussed. In short, previous research used accuracy to evaluate the success of their method. But, accuracy does not take into account the fact that personality is measured on a continu-ous scale, and it dos not take into account that the method is compared with a gold standard, not a ground truth, and that it is possible that the new method is better then the gold standard to which it is compared. To overcome both shortcomings, the system is evaluated in two ways.

Firstly, the author profiling results are compared to the gold-standard method for measuring personality. Because both methods contain a certain amount of measurement error, and both give results on a continuous scale, a Bland-Altman approach is argued to be the best evaluation method [10]. This method can show if two techniques are drastically dif-ferent, even with only a small sample.

Secondly, when they do di↵er significantly, it is not directly concluded that the new system does not function. For this, we must further evaluate the results. A group of raters is introduced into the concept of the big five personality traits, and explained where these traits stand for. After that, they are shown all politicians with their according personalities according to the two methods. For this, a spider plot as shown in figure 2 is used, because it enables easy comparison between the two personalities. Their assignment is to choose the best fitting personality for the politician.

4. RESULTS

4.1 Gold Standard data

For the gold standard, a survey was conducted in which raters used a short personality test to determine the per-ceived personality of di↵erent politicians. In total, 144 rat-ings were gathered. Each rating resulted in a score between -4 and 4 for all five personality traits. As can be seen in ta-ble 2, the results strongly follow a normal distribution. No further transformations were applied to the data.

The mean values of the personality traits di↵er a lot from the mean value of the complete range. Politicians in the survey score much higher then 0 (the mean of the full possi-ble data range) for extraversion and conscientiousness, and much lower for neuroticism.

Table 2: Descriptives of the gold standard survey data. Personality Trait Mean SD Skewness Kurtosis Agreeableness -0.3 2.079 -0.247 -0.690 Extraversion 0.76 1.900 -0.73 -0.409 Openness to Exp. 0.18 1.581 0.118 0.572 Conscientiousness 1.06 1.906 -0.283 -0.633 Neuroticism -1.08 1.794 0.241 -0.198

5. DATA CLEANING

For a fair comparison between the gold standard method (survey) and the author profiling method introduced in this research, politicians have to be selected for who enough data is available for both methods to base their results on. To make the selection process as fair as possible, a set of quality requirements where set.

For a politician to be considered in the evaluation. He has to comply to the following:

• In the survey, at least three participants rated the politician.

• The survey result data for the politician approaches a normal distribution, with the mean of the standard deviations for the big five personality dimensions lower then 1.8. This because a low SD indicates most val-ues to be near the mean, and thus general agreement between the raters,

• The Twitter account associated with the politician is active, which means at least one real tweet (no retweet) in the past week.

• The Twitter account follows at least 45 people. These requirements make sure that the survey is based on enough opinions. Next to that, they make sure that the Twitter account contains enough tweets and befriended users to perform meaningful calculations. Table 4 in appendix B displays an overview of all politicians and their scores on these items is shown. Politicians who are printed in bold passed the requirements, and are used in further analysis. These politicians are: Jeanine Hennis, Lodewijk Asscher, Arie Slob, Boris van der Ham, Emile Roemer, Marianne Thieme, Tofik Dibi, Fleur Agema and Alexander Pechtold. These nine politicians form the subset of data on which the following evaluation is based.

(9)

6. EVALUATION

As described in the method section of this paper, the author profiling method introduced in this thesis is evaluated in two ways.

6.1 Testing for equality

Firstly, the new method is compared to the gold standard method. The gold standard method of personality determi-nation, at this moment, consists of a survey with personality markers. From this survey, the mean of all raters was used as the resulting personality.

Before these two big five representations were compared, they were both transformed to a scale from 0 to 1. For most Twitter survey measures, this was already the case. The gold standard data all originally were on a scale from -4 to 4, and had to be transformed. To do this, the following simple transformation was used:

v min

max min (5)

where min of course is -4 and max is 4.

After this, the di↵erence between the two methods was cal-culated for each pair of politician and personality trait. Next to that, the mean of the values that come from the two meth-ods is also calculated.

This information was then used to, for each personality trait, construct two plots in the way described by [10]. These plots are shown in figures 3 - 7 (appendix A).

From analyzing these plots, we can say that the two methods do not return the same results for the selected politicians. Next to that, the noise does not seem to follow a pattern, and further transformations will most likely not significantly improve the fit. The only possibility to a minor improvement would be to exclude outliers, like in the case of conscientious-ness where one politician clearly scores di↵erent (higher on Twitter, lower on survey) then the others.

6.2 Human evaluation

But, that the Twitter based measure is di↵erent then the gold standard does not necessarily mean that the method performs worse. This hypothesis is partly supported by the survey evaluation results.

The survey results (N = 12) were analyzed using a binomial test to find if one of the two methods was chosen signif-icantly more often then we would expect from change (= .5). The results are shown in table 3. Two politicians were found to be represented significantly better using the sur-vey. These were Arie Slob (p = 0.002) and tofik dibi (p = 0.016). The other politicians were not represented sig-nificantly better by one of the two methods, but this can also be a matter of a lack of participants. Marianne Thieme and Alexander Pechtold for example have more votes for the Twitter analysis method’s result then for the gold standard method.

What catches the attention is the relative low inter-rater

Table 3: Binomial test results of survey evaluation of the gold standard (survey) method vs. the Twitter analysis method.

Politician # survey # Twitter Sig.

Jeanine Hennis 3 8 .227

Lodewijk Asscher 6 4 .754

Arie Slob 10 0 .002

Boris van der Ham 4 3 1.000

Emile Roemer 7 4 .549

Marianne Thieme 2 7 .180

Tofik Dibi 7 0 .016

Fleur Agema 5 1 .219

Alexander Pechtold 5 7 .774

agreement in the evaluation survey. Even while this was a relative simple task (two quite di↵erent personalities: which one fits the politician better?). For inter-rater agreement calculation, the Klippendorf’s alpha measure was used, where it performs well with more then two raters and lot’s of miss-ing data, as in our case [12]. Klippendorf’s alpha for the evaluation survey was 0.3722, where .67 is generally seen as a minimum for meaningful results [12].

7. CONCLUSION

In this paper we investigate the possibilities of developing an author profiling system that determines personality by look-ing at characteristics of personality traits found in existlook-ing research.

During the research, such a system was constructed and evaluated. In the evaluation, the personality scores from the proposed system are found to be very di↵erent then the ones found by the personality test that serves as a golden standard. It would have thus been expected that the human evaluation in which participants are asked to compare the proposed personalities of the gold standard personality test and the system would result in significantly more votes for one of the two methods. This is, however, not the case. The strong di↵erence in resulting personalities between the two methods can not be clearly seen in the human evalua-tions, where the personality scores from the author profiling system were only significantly worse then the personality scores from the personality test for two politicians. In the other cases, both methods performed equally well or the au-thor profiling system performed slightly better.

Because of this, it is not possible to say if the system was successful, and no definitive answer can be given to the main research question.

7.1 Discussion

That the results are rather surprising, is clear. When the results of two methods of personality determination are com-pared, and are found to being very di↵erent, one would ex-pect one to be superior then the other. But, the human evaluation found that only two politicians were significantly better represented by the gold standard.

(10)

mention that making a system that returns the same results as the gold standard personality test is not necessarily the final goal of this research. The goal is to make a system that shows the general view of a politician’s personality by the public.

This can also be the reason of why the human evaluation gives these results. Raters were found to not agree even in the simple task of choosing between two completely di↵erent personalities. Maybe, there is no such thing as one average personality, but is the view of politicians highly polarized. Next to the evaluation, we have found more interesting re-sults. For example in the determination of the gold stan-dard, were the traits did not always have a mean roughly equal to 0, but scored di↵erently. Politicians were found to score relatively high on extraversion and conscientiousness, and much lower then expected for neuroticism.

This can of course be explained by the fact that politicians with certain personality characteristics have a higher likeli-hood to become well-known politicians [5]. These personal-ity characteristics could very well result in a high score for extraversion (necessary in propagating the beliefs of the po-litical party) and conscientiousness. Conscientiousness was found to positively correlate with overall job performance across multiple professions [2], so it could very well corre-late positively with being successful as a politician as well.

7.2 Future work

That no clearly e↵ective author profiling system was found, does not necessarily mean it is impossible to do so while using personality characteristics found in research.

One way to make the purposed author profiling system more e↵ective might be to use a di↵erent case. The image of politicians is highly modeled by the media, and maybe not really representative to the real politician. Next to that, some politicians have spin doctors influencing their behavior on twitter. This influences both the results of the survey as the results of the twitter account, making the error larger. Analyzing normal people with a personal twitter account, and a survey with people who have a normal relationship with the target person in real life requires more e↵ort, but might give better results because of a more representative image on both the (personal) twitter account and the survey. Another way would be to use a larger set of measures. Be-cause of the limited time available for this research, only one personality characteristic-based measure was used per per-sonality trait. But such a measure can never give a complete view of the trait, where a trait is often identified using mul-tiple characteristics. Expanding the set of measurements in a way that multiple characteristics are combined to find all personality traits might improve the results significantly. A final method to improve results would be to use a larger set of people to analyze. A larger sample would make it easier to find correlations between the gold standard and author profiling personalities, and reduce noise.

Next to improving the method in which personality is de-termined, the big five could also be improved. In the

cur-rent implementation of the big five, there is no room for description of how polarized the views are on a certain per-son’s personality. While this can be extremely interesting information. Because personality is not always objective, a polarization dimension will be able to predict how many people will agree.

Author profiling of personality is an interesting problem with many applications. But, before we can accurately mine per-sonalities, there needs to be more research on how person-ality can be identified on the internet. That is the only way in which a significantly better performing system then the normal text-feature based classifiers can be designed. This research was a first step in this, with an introduction in this new approach. But follow-up research is necessary to im-prove our knowledge and experiences with the combination of personality mining and the internet.

8. REFERENCES

[1] Y. Amichai-Hamburger and G. Vinitzky. Social network use and personality. Computers in human behavior, 26(6):1289–1295, 2010.

[2] M. R. Barrick and M. K. Mount. The big five personality dimensions and job performance: A meta-analysis. 1991.

[3] J. Block. The five-factor framing of personality and beyond: Some ruminations. Psychological Inquiry, 21(1):2–25, 2010.

[4] J. C. Butler. Personality and emotional correlates of right-wing authoritarianism. Social Behavior and Personality: an international journal, 28(1):1–14, 2000.

[5] G. V. Caprara and P. G. Zimbardo. Personalizing politics: a congruency model of political preference. American Psychologist, 59(7):581, 2004.

[6] M. A. Conard. Aptitude is not enough: How personality and behavior predict academic performance. Journal of Research in Personality, 40(3):339–346, 2006.

[7] P. T. Costa and R. R. McCrae. Neo Personality Inventory-Revised (NEO PI-R). Psychological Assessment Resources, 1992.

[8] P. T. Costa and R. R. McCrae. Solid ground in the wetlands of personality: A reply to block. 1995. [9] P. T. Costa and R. R. McCrae. The revised neo

personality inventory (neo-pi-r). The SAGE handbook of personality theory and assessment, 2:179–198, 2008. [10] G. E. Dallal. Comparing two measurement devices,

2000.

[11] F. De Fruyt, R. R. McCrae, Z. Szirm´ak, and J. Nagy. The five-factor personality inventory as a measure of the five-factor model belgian, american, and hungarian comparisons with the neo-pi-r. Assessment,

11(3):207–215, 2004.

[12] K. De Swert. Calculating inter-coder reliability in media content analysis using krippendor↵ˆa ˘A´Zs alpha. Center for Politics and Communication, 2012. [13] D. Ediger, K. Jiang, J. Riedy, D. Bader, C. Corley,

R. Farber, W. N. Reynolds, et al. Massive social network analysis: Mining twitter for social good. In Parallel Processing (ICPP), 2010 39th International Conference on, pages 583–593. IEEE, 2010.

(11)

[14] G. Farnadi, S. Zoghbi, M.-F. Moens, and M. De Cock. Recognising personality traits using facebook status updates. Proc. of WCPR, pages 14–18, 2013. [15] L. R. Goldberg. The development of markers for the

big-five factor structure. Psychological assessment, 4(1):26, 1992.

[16] S. D. Gosling, P. J. Rentfrow, and W. B. Swann. A very brief measure of the big-five personality domains. Journal of Research in personality, 37(6):504–528, 2003.

[17] J. T. Hancock, C. Landrigan, and C. Silver.

Expressing emotion in text-based communication. In Proceedings of the SIGCHI conference on human factors in computing systems, pages 929–932. ACM, 2007.

[18] B. F. Jeronimus, H. Riese, R. Sanderman, and J. Ormel. Mutual reinforcement between neuroticism and life experiences: A five-wave, 16-year study to test reciprocal causation. Journal of personality and social psychology, 107(4):751, 2014.

[19] O. P. John and S. Srivastava. The big five trait taxonomy: History, measurement, and theoretical perspectives. Handbook of personality: Theory and research, 2(1999):102–138, 1999.

[20] E. Leach. The influence of cultural context on non-verbal communication in man. Non-verbal communication, pages 315–349, 1972.

[21] S.-K. Lo. The nonverbal communication functions of emoticons in computer-mediated communication. CyberPsychology & Behavior, 11(5):595–597, 2008. [22] A. P. L´opez-Monroy, M. Montes-y G´omez, H. J.

Escalante, and L. Villase˜nor-Pineda. Using intra-profile information for author profiling.

[23] R. Martin, D. Watson, and C. K. Wan. A three-factor model of trait anger: Dimensions of a↵ect, behavior, and cognition. Journal of personality, 68(5):869–897, 2000.

[24] R. R. McCrae. Social consequences of experiential openness. Psychological bulletin, 120(3):323, 1996. [25] M. Ogot and G. E. Okudan. The five-factor model

personality assessment for improved student design team performance. European Journal of Engineering Education, 31(5):517–529, 2006.

[26] D. C. Pennington. Essential personality. Oxford University Press, 2003.

[27] B. Rammstedt and O. P. John. Measuring personality in one minute or less: A 10-item short version of the big five inventory in english and german. Journal of research in Personality, 41(1):203–212, 2007. [28] F. Rangel, P. Rosso, M. Moshe Koppel,

E. Stamatatos, and G. Inches. Overview of the author profiling task at pan 2013. In CLEF Conference on Multilingual and Multimodal Information Access Evaluation, pages 352–365. CELCT, 2013. [29] G. Seidman. Self-presentation and belonging on

facebook: How personality influences social media use and motivations. Personality and Individual

Di↵erences, 54(3):402–407, 2013.

[30] E. R. Thompson. Development and validation of an international english big-five mini-markers. Personality and Individual Di↵erences, 45(6):542–548, 2008.

[31] Twitter. Twitter api documentation.

[32] B. Verhoeven and W. Daelemans. Clips stylometry investigation (csi) corpus: A dutch corpus for the detection of age, gender, personality, sentiment and deception in text. In Proc. of the 9th Int. Conf. on Language Resources and Evaluation, 2014.

(12)

APPENDIX

A. BLAND-ALTMAN PLOTS FOR EVALUATION

For each personality trait, the di↵erence between the resulting scores for the gold standard method (y-axis) and the Twitter analysis method (x-axis) is shown on the left. On the right, we see a plot with the relation between the mean of the two methods and the di↵erence between the results.

The plot on the left shows till what extend the results of the two methods agree with each other. When the dots follow the line in the center, they often agree. The plot on the right can then be used to analyze the di↵erences. When all dots appear to be on one side of the line, we can conclude that there is most likely an error that can be resolved by a transformation of the data. When the dots seem to be all over the plot it means that the error is random, and that no improvement is possible.

Figure 3: Agreeableness

(13)

Figure 5: Extraversion

.

Figure 6: Neuroticism

Figure 7: Openness to experience

B. DESCRIPTIVES FOR DATA CLEANING

(14)

T a bl e 4 : P o li ti ci a n se le ct io n ba se d o n surv ey da ta spre a d a nd T w it te r da ta av a il a bi li ty . B o ld it em s w er e us ed fo r furt he r ev a lua ti o n. P o li ti ci a n R a te rs SD’ s: A g re ea bl ene ss E x tra v ers io n O p enne ss C o ns ci en ti o us ne ss N euro ti ci sm Me an SD T w it te r: # F o ll ow ing A ct iv e m a ri jni ss enl 1 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0,0 00 139 1 So phi ei n tV el d 1 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0,0 00 1157 1 W a ss il a H a chc hi 1 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0 ,0 0 0 0,0 00 2164 1 a no uc hk av m 2 0 ,7 0 7 0 ,7 0 7 0 ,7 0 7 2 ,1 2 1 0 ,0 0 0 0,8 48 51 bra m va no ji k g l 2 0 ,7 0 7 1 ,4 1 4 0 ,7 0 7 1 ,4 1 4 0 ,0 0 0 0,8 48 350 0 jc de ja g er 2 0 ,0 0 0 2 ,1 2 1 2 ,8 2 8 0 ,7 0 7 0 ,7 0 7 1,2 73 24 0 jo la nde sa p 2 2 ,8 2 3 2 ,1 2 1 0 ,7 0 7 2 ,8 2 8 0 ,0 0 0 1,6 96 196 0 tuna ha nk uz u 2 2 ,1 2 1 0 ,0 0 0 0 ,7 0 7 2 ,1 2 1 0 ,7 0 7 1,1 31 137 1 ha lb ez ijl st ra 3 1 ,5 2 8 1 ,7 3 2 1 ,0 0 0 1 ,7 3 2 0 ,0 0 0 1,1 98 108 0 he nk k ro l 3 1 ,5 2 8 2 ,6 4 6 1 ,5 2 8 2 ,5 1 7 1 ,1 5 5 1,8 75 300 1 ja sp erv a ndi jk sp 3 1 ,5 2 8 1 ,0 0 0 0 ,5 7 7 1 ,0 0 0 1 ,0 0 0 1,0 21 325 1 k ee sv ds ta a ij 3 0 ,5 7 7 2 ,6 4 6 2 ,0 0 0 1 ,5 2 8 0 ,0 0 0 1,3 50 1387 1 Mo na K ei jz er 3 0 ,0 0 0 1 ,0 0 0 0 ,5 7 7 0 ,5 7 7 2 ,0 8 2 0,8 47 641 1 pi a di jk st ra 3 4 ,0 4 1 2 ,5 1 7 2 ,3 0 9 2 ,5 1 7 1 ,0 0 0 2,4 77 996 1 ha ns sp ek m a n 4 3 ,4 1 6 1 ,5 0 0 1 ,2 5 8 1 ,9 1 5 0 ,9 5 7 1,8 09 418 1 jea n in eh en n is 4 1 ,2 5 8 1 ,2 5 8 1 ,5 0 0 2 ,0 6 2 2 ,0 8 2 1,6 32 34 89 1 je ss ek la v er 4 1 ,2 5 8 1 ,7 0 8 1 ,5 0 0 1 ,7 3 2 1 ,0 0 0 1,4 40 497 0 lo d ew ij k a 4 1 ,2 5 8 0 ,8 1 6 0 ,0 0 0 1 ,7 3 2 1 ,0 0 0 0,9 61 10 45 1 sy bra ndbum a 5 1 ,1 4 0 2 ,4 9 0 1 ,9 4 9 1 ,7 8 9 1 ,9 4 9 1,8 63 241 0 a ri es lo b 6 1 ,6 3 3 1 ,2 1 1 1 ,6 7 3 1 ,0 4 9 1 ,6 0 2 1,4 34 61 3 1 bo ri sh a m 6 0 ,7 5 3 1 ,0 3 3 1 ,3 7 8 1 ,8 3 5 1 ,3 6 6 1,2 73 42 18 1 ca m ie le url ing s 6 1 ,5 0 6 2 ,1 6 8 1 ,9 7 5 1 ,3 2 9 1 ,7 5 1 1,7 46 76 0 di ede ri k sa m so m 6 0 ,8 1 6 1 ,7 2 2 1 ,3 6 6 2 ,4 2 2 2 ,0 4 1 1,6 73 476 0 em il er o em er 6 1 ,8 3 5 1 ,3 6 6 1 ,0 3 3 1 ,9 6 6 1 ,0 4 9 1,4 50 54 0 1 eri ca te rps tra 6 2 ,0 0 0 0 ,8 3 7 2 ,5 6 3 2 ,1 6 0 1 ,6 3 3 1,8 39 30 fe m k eha ls em a 6 0 ,7 5 3 1 ,1 6 9 1 ,5 4 9 1 ,3 6 6 1 ,3 6 6 1,2 41 660 0 jpba lk ene nde 6 2 ,0 4 1 2 ,6 0 8 1 ,8 3 5 2 ,5 3 0 2 ,2 2 9 2,2 49 21 m a ri a n n et h ie m e 6 2 ,2 5 8 2 ,0 0 0 1 ,2 6 5 0 ,9 8 3 1 ,8 6 2 1,6 74 30 88 1 Mi nP re s 6 1 ,6 4 3 1 ,6 7 3 0 ,5 1 6 1 ,0 9 5 1 ,6 3 3 1,3 12 01 to fik di bi 6 1 ,3 6 6 0, 7 53 1 ,03 3 1 ,7 8 9 1, 7 89 1,3 46 44 5 1 fl eu ra g em a p v v 7 2 ,0 0 0 1 ,5 7 4 0 ,7 8 7 1 ,2 7 2 1 ,8 9 0 1,5 05 45 1 g ee rt w il de rs p v v 8 0 ,7 0 7 1 ,4 0 8 2 ,2 0 0 2 ,2 0 0 1 ,6 6 9 1,6 37 01 j di js se lbl o em 8 2 ,0 6 6 1 ,9 9 6 1 ,3 0 9 2 ,1 3 4 1 ,5 1 2 1,8 03 353 1 a p ec h to ld 9 2 ,1 2 8 1 ,6 5 8 1 ,7 1 6 1 ,3 2 3 2 ,0 2 8 1,7 71 45 0 1 A ver ages 1,39 4 1,4 37 1, 237 1 ,581 1,14 9 1,3 59