Multi-Emotion Detection in Dutch Political Tweets

(1)

Multi-Emotion Detection in Dutch

Political Tweets

Vincent S. Erich 10384081 Bachelor thesis Credits: 18 EC

Bachelor Opleiding Kunstmatige Intelligentie University of Amsterdam Faculty of Science Science Park 904 1098 XH Amsterdam Supervisors dhr. drs. Isaac Sijaranamual dhr. dr. Evangelos Kanoulas

Information and Language Processing Systems Faculty of Science

University of Amsterdam Science Park 904 1098 XH Amsterdam

(2)

Abstract

This work describes the development of an emotion classification system (i.e., a classifier) for ThemeStreams that can detect and clas-sify multiple emotions in Dutch political tweets. Using data from dif-ferent installations of ThemeStreams, a hand-labelled dataset of 399 tweets covering eight emotion categories has been realised. Two prob-lem reduction methods have been tested for performing multi-label classification: the binary relevance method and the random k -labelset (RAKEL) method. Using the binary relevance method as problem reduction method, a multi-label classifier has been developed that achieves an overall F1 score of 0.209 on the developed dataset. This F1 score has been achieved by testing different combinations of fea-tures and classifier parameters, of which the combination that uses unigrams, bigrams, and Part-of-Speech information as features, results in the highest overall F1score.

(3)

1 Introduction

Emotions play an important role in everyday life: they influence decision-making, social interaction, perception, memory, and more. Emotions are not only communicated verbally and visually, but also textually in the form of microblog posts, blog posts, and forum discussions. Due to the increasing amount of such emotion-rich textual content, recognition of emotion in writ-ten text has received a great amount of atwrit-tention in recent years. Much of this attention is usually focused on sentiment analysis and opinion mining, which deals with the automatic identification and extraction of a writer’s attitude with respect to some topic (i.e., whether the expressed opinion is positive, negative, or neutral).

This work focuses on the automatic detection and classification of more fine-grained emotional states in the context of tweets. This work is the consequence of the development of ThemeStreams (de Rooij et al., 2013). ThemeStreams is an on-line web interface that visualizes the stream of themes discussed in politics. The system maps political discussions to themes and influencers and visualizes this mapping. The current system provides an overview of the topics discussed in Dutch politics by analysing the tweets from people from four influencer groups.

The goal of this work is to develop an emotion classification system (i.e., a classifier) for ThemeStreams that can be used to examine whether the emotions in the tweets affect the political discussions. The classifier must be able to detect multiple emotions in the tweets and classify these emotions. If a tweet expresses multiple emotions, the classifier must assign multiple emotion labels to the tweet (i.e., it is a multi-label classification task). Thus, the following research question is addressed in this work: How to develop a classifier that can detect multiple emotions in Dutch political tweets?

To answer this question, data dumps from different installations of Theme-Streams are used. Every data dump consists of a couple of hundred thousand tweets in JSON format. The label set that is used is the same as the one that has been used by Buitinck et al. (2015) and consists of the seven ba-sic emotions identified by Shaver et al. (1987), with the emotion ‘interest’ added. This work explores how to develop a labelled dataset, what features to use, and what supervised machine learning algorithm to use.

This work proceeds in five parts. First, the related work is reviewed in Section 2, Section 3 then describes the approach that has been taken, followed by the results in Section 4. Section 5 summarises the work, and Section 6 concludes with the discussion and plans for future research.

(5)

2 Related Work

Recognition of emotion in written text has been the focus of much research in recent years; there has been a great deal of research on developing models that can automatically classify texts as expressing certain emotions. Much of the initial work on emotion recognition in written text has focused on polarity classification, that is, whether a text expresses positive, negative, or neutral emotions. Alm et al. (2005), for example, have explored the automatic classification of sentences in children’s fairy tales as expressing positive, negative, or no emotions (neutral).

However, the recent trend has been to go beyond polarity classification and detect a broader spectrum of emotions. This yields the question of what kind of emotions can be detected in written text. Most label sets that are used in emotion recognition tasks seem to be based on the six basic emotions identified in the hierarchical cluster analysis done by Shaver et al. (1987), or on the six basic emotions identified by Ekman (1992).

Danisman and Alpkocak (2008), for example, use a Vector Space Model (VSM) for the automatic classification of anger, disgust, fear, joy, and sad emotions at the sentence/snippet level. Their classifier achieves an overall F-measure value of 32.22% in a five-way multiclass prediction problem on the SemEval dataset. In a more recent study, Buitinck et al. (2015) describe and experimentally validate two algorithms that can detect multiple emotions in sentences of movie reviews. They do so by reducing the multi-label problem to either binary or multiclass learning.

More related to this work is that of Wang et al. (2012) and Bala-bantaray et al. (2012), who both perform emotion recognition on tweets. Wang et al. (2012) demonstrate that a combination of unigrams, bigrams, sentiment/emotion-bearing words, and Part-of-Speech information seems to be the most effective for extracting emotions from tweets. They achieve an accuracy of 65.57% in a seven-way multiclass prediction problem using training data consisting of about 2 million tweets (automatically collected and labelled). Balabantaray et al. (2012) use a hand-labeled dataset of 8150 tweets, and use a classifier based on multiclass SVM kernels to classify tweets into six categories of emotion. Their classifier achieves an accuracy of 73.24%.

This work differs from the work described above in two important ways. First, it is not assumed that emotions are mutually exclusive (i.e., a tweet can express multiple emotions). Second, the dataset that is used in this work consists of a few hundred hand-labelled Dutch tweets, whereas earlier studies have used datasets consisting of thousands of English tweets.

(6)

Table 1: Comparison of different approaches. *: anger, disgust, fear, sad-ness, joy **: anger, fear, interest, joy, love, sadsad-ness, surprise ***: happisad-ness, sadness, anger, disgust, surprise, fear + neutral class ****: joy, sadness, anger, love, fear, thankfulness, surprise

Study Dataset Classifier Performance

Danisman and Alpkocak (2008)

Domain: sentences/text snippets Size: 7666 text snippets

Five emotion categories* VSM F1:32.22%

Buitinck et al. (2015)

Domain: Sentences of English movie reviews Size: 629 sentences

Seven emotion categories** Multiple labels per sentence Manually annotated One-vs.-Rest Random k -labelset Acc: 0.841, F1: 0.432 Acc: 0.854, F1: 0.456 Balabantaray et al. (2012)

Domain: English tweets Size: 8150 tweets

Six emotion categories*** One label per tweet

Manually annotated SVMs Acc: 73.24%

Wang et al. (2012)

Domain: English tweets Size: ± 2.5M tweets

Seven emotion categories**** One label per tweet

Automatically annotated LIBLINEAR Acc: 65.57%

3 Method and Approach

As described in the introduction, the goal of this work is to develop an emotion classification system (i.e., a classifier) for ThemeStreams that can detect and classify multiple emotions in Dutch political tweets. This section describes the approach that has been taken to achieve this goal and proceeds in five parts. First, Section 3.1 describes the data and the steps that have been taken to reduce the data. The development of a labelled dataset is described in section 3.2, followed by an overview of the data preprocessing in Section 3.3. Section 3.4 then describes the feature set. Finally, Section 3.5 describes the learning algorithms.

3.1 Data and Data Reduction

The data that has been used in this work was obtained from four data dumps from different installations of ThemeStreams. Every data dump consists of a couple of hundred thousand tweets in JSON format (JSON strings). Every JSON string in a data dump contains a tweet and metadata about the writer of the tweet (i.e., the influencer group to which the writer belongs, profile information) and the tweet itself (whether the tweet is a retweet, whether

(7)

Table 2: An overview of the data reduction process. Combining the data from the four data dumps resulted in a dataset of 1100306 tweets. The final dataset contains 1500 tweets.

Step Number of tweets in the dataset

1. Combining the data from the four data dumps 1100306

2. Removing duplicate tweets 368460

3. Applying filtering heuristics 188399

4. Random sample selection 1500

there are links, hashtags, and mentions in the tweet). Every JSON string also has an unique identifier (id).

Combining the data from the four data dumps resulted in a dataset of 1100306 tweets (JSON strings). However, since two of the data dumps were from two different installations of ThemeStreams that simultaneously collected data, the resulting dataset contained a lot of duplicate tweets. Re-moving duplicate tweets from the dataset resulted in a new dataset of 368460 tweets; a reduction of 66.51%. A set of filtering heuristics was developed to filter out irrelevant tweets (i.e., tweets that do not express any emotion). For this, a number of filtering heuristics described in Wang et al. (2012, pp. 588-589) were adopted and implemented. The filtering heuristics are as follows:

• Discard retweets (i.e., tweets starting with ‘RT’).

• Discard tweets which contain URLs. • Discard tweets which contain quotations.

• Discard tweets which have less than five words. Only user mentions (i.e., ‘@person’) are not considered as words.

• Discard tweets having more than three hashtags.

A random sample of 300 tweets was selected from the aforementioned dataset of 368460 tweets. For every tweet in this sample it was determined whether it was relevant or not, and, if the tweet was irrelevant, whether one of the filtering heuristics would discard it. Based on this analysis it was assumed that the filtering heuristics are effective.

Applying the filtering heuristics on the aforementioned dataset of 368460 tweets resulted in a new dataset of 188399 tweets; a reduction of 48.87%. However, since this new dataset still contained too many tweets to label manually, a random sample of 1500 tweets was selected as the final dataset. Table 2 summarizes the data reduction process.

(8)

Table 3: The eight emotions (emotion categories) that are included in the label set (in English and in Dutch).

English Dutch 1. Anger Boosheid 2. Fear Angst 3. Interest Interesse 4. Joy Vreugde 5. Sadness Verdriet 6. Surprise Verrassing 7. Disgust/contempt Afschuw 8. Love Liefde

3.2 Labelling the Dataset

Since the tweets in the final dataset did not have emotion labels associated with them (i.e., it was an unlabelled dataset), the tweets had to be labelled. The label set that is used in this work is the same as the one that has been used by Buitinck et al. (2015) and consists of the seven basic emotions identified by Shaver et al. (1987), with the emotion ‘interest’ added. Thus, the label set consists of the following eight emotions (emotion categories): anger, fear, interest, joy, sadness, surprise, disgust/contempt, and love.

CrowdFlower1 has been used to obtain a labelled dataset. CrowdFlower is an online crowdsourcing service that (amongst others) allows costumers to put a labelling task online (tasks are called ‘jobs’ in the CrowdFlower platform) and let contributors label the data against a small fee.

For every emotion category, a CrowdFlower job has been set up which resulted in eight different jobs. Every job is a binary classification task. Another option would have been to set up a single CrowdFlower job and ask contributors to select all the emotion categories that are present in a tweet (from the eight emotion categories that are included in the label set). However, selecting all the emotion categories that are present in a tweet from a predefined set of emotion categories is more difficult than deciding whether a single emotion category is present in a tweet. It was therefore decided to use the aforementioned setup and set up one CrowdFlower job per emotion category. It is also assumed that this results in a higher-quality labelled dataset.

Every job used the same final dataset of 1500 tweets. However, since CrowdFlower only allows spreadsheets to be uploaded (.csv, .tsv, .xlsx, or .ods), the final dataset (with JSON strings) was converted to a .csv file with three columns: ‘tweet id’ (the id of the tweet), ‘name’ (the name of the author of the tweet), and ‘content’ (the tweet itself).

(9)

Figure 1: A question from the job for the emotion anger (Dutch: boosheid ). (A) The title of the job. (B) The ‘Instructions’ button. If clicked, the instructions are shown to the contributor. (C) The tweet in question. (D) The question and the choices/answer options.

In every job, contributors were asked whether the author of a tweet was upset about something (in case of the emotion anger ), anxious for something (in case of the emotion fear ), interested in something (in case of the emotion interest ), etc. For the emotion love, contributors were asked whether the author of a tweet had affection/devotion for someone/something. Contrib-utors always had three choices: ‘Yes’, ‘No’, or ‘No idea’. Figure 1 shows a question from the job for the emotion anger.

Every job was provided with the same set of instructions. The instruc-tions gave an overview of the job, a description of the procedure, and a short summary. Below are some important points regarding the instructions:

• Since the tweets are in Dutch, it was explicitly stated that the jobs were only intended for contributors who can read (and understand) Dutch.

• Contributors were asked to determine whether the author of a tweet was upset about something (in case of the emotion anger ), anxious for something (in case of the emotion fear ), interested in something (in case of the emotion interest ), etc., not the person for whom the tweet is intended.

(10)

not matter to what extent the emotion in question occurs in a tweet. • Since emotional states are generally not mutually exclusive,

contribu-tors were asked to focus solely on the emotion in question.

• Contributors were asked not to overuse the choice ‘No Idea’ and to only select this choice if they were in doubt or had no idea.

Before the quality settings of the jobs are discussed, it is important to address the following two points. (1) Before launching a job, it is required to create a number of test questions. Test questions are used to train contrib-utors and to remove contribcontrib-utors that do not perform well enough. When creating test questions, CrowdFlower randomly selects a row (tweet) from the uploaded dataset and allows you to give the correct answer (it is also possible to skip a row whereupon CrowdFlower just selects another random row from the uploaded dataset). It is also possible to upload test questions. (2) When a job is launched, the uploaded dataset is divided into a number of pages. Each page show a variable number of rows (tweets) that contributors can label, for example: If the uploaded dataset consists of 100 rows, and the number of rows per page is set to ten, then the dataset will be divided into ten pages.

In order to obtain a high-quality labelled dataset, the number of con-tributors who will answer a given question (i.e., label a given tweet) was set to three throughout all the jobs. The number of rows (tweets) displayed on each page was set to ten and only contributors from Dutch-speaking countries were allowed to the jobs (i.e., Belgium, Netherlands, Suriname). Furthermore, contributors had to maintain a minimum accuracy of 70% on the test questions, and the minimum time it should take a contributor to complete a page of work was set to 30 seconds (i.e., three seconds per tweet on average). Contributors who did not maintain a minimum accuracy of 70% on the test questions, or completed a page of work in less than 30 seconds, were removed from the jobs (including the data they had labelled). When a contributor has worked on a job, he/she can voluntarily take part in a contributor exit survey. In this exit survey, the job is assessed on five points: ‘Overall’, ‘Instructions Clear’, ‘Test Questions Fair’, ‘Ease Of Job’, and ‘Pay’. In order to gain insight into the quality of the created jobs, a test run was done for every job in which the first 100 rows of the uploaded dataset were launched. Since the results of the contributor exit surveys showed that every job scored low on ‘Test Questions Fair’ (it turned out that in every job too many test questions were being contested by contributors), changes were made to the test questions: Every test question that was being contested by contributors was reconsidered and if the contributors’ contentions were deemed justified, the test question was removed from the job. Furthermore, since CrowdFlower indicated that every job contained too few test questions, extra test questions were created as follows: For every job, the aggregated

(11)

Figure 2: The results of the contributor exit surveys of the jobs (overall, results of the test runs are included). In every exit survey, a job is assessed on five points: ‘Overall’, ‘Instructions Clear’, ‘Test Questions Fair’, ‘Ease Of Job’, and ‘Pay’. The maximum score for each category is five.

results were downloaded (which show the single, top answer for every row) and every row that was not a test question and where the confidence of the answer was 1.0 (i.e., all three contributors agreed upon the answer), was transformed into a test question.

Due to the limited resources that were available to us, it was only possible to launch 300 other rows for every job (i.e., rows 101 to 400 of the uploaded dataset). Thus, only 400 rows of the uploaded dataset have been labelled in every job.

Figure 2 shows the results of the contributor exit surveys of the jobs. Since the overall score for every job is ≥ 4, it is assumed that the jobs are of good quality. Although every job was provided with the same set of instructions, the score for ‘Instructions Clear’ varies over the jobs, with the the job for the emotion joy having the lowest score (4.1) and the job for the emotion love having the highest score (5). Since the score for ‘Instructions Clear’ is ≥ 4.1 for every job, it can be assumed that contributors knew what was expected of them.

(12)

Table 4: The label distribution in the annotated final dataset (399 tweets).

Label Absolute label frequency Percentage of total number of labels (%)

Anger 68 14.91 Fear 9 1.97 Interest 130 28.51 Joy 51 11.18 Sadness 14 3.07 Surprise 41 8.99 Disgust/contempt 113 24.78 Love 30 6.58 456 100.00

Even though test questions were created by selecting tweets where three contributors agreed upon the answer (see above), the scores for ‘Test Ques-tions Fair’ are not as high as was expected: the job for the emotion interest has the lowest score for ‘Test Questions Fair’ (2.6) and the job for the emo-tion love has the highest score for ‘Test Quesemo-tions Fair’ (3.8). The low scores for ‘Test Questions Fair’ might be due to the subjective character of emotions or the difficulty of detecting an emotion. Nevertheless, the created test questions were assumed to be fair and to represent what was expected of contributors.

The score for ‘Ease Of Job’ also varies over the jobs, with the job for the emotion joy having the lowest score (3.8) and the job for the emotion love having the highest score (4.8). Since the score for ‘Ease Of Job’ is ≥ 3.8 for every job, it is assumed that contributors found the jobs easy (which is why eight different jobs were set up). Further analysis of the results in Figure 2 is beyond the scope of this work.

The final dataset of 1500 tweets (JSON strings) was labelled using the aggregated results from every job. The aggregated results show the single, top answer for every row (i.e., the answer that most contributors agree upon). Only rows (tweets) that have been labelled in all the eight jobs are included in the labelled final dataset. However, this resulted in 399 labelled tweets instead of 400 labelled tweets2. The label distribution is given in Table 4.

Of the 399 tweets, 127 have no label(s), showing that expression of emo-tions is not prevalent in the tweets. Table 5 maps the number of labels per tweet to the number of tweets, showing that the maximum number of labels per tweet is five (the combination ‘Anger-Fear-Interest-Sadness-Disgust/contempt’, which occurs once). Table 6 shows the text of the tweet that has five labels associated with it, including four other examples from

2

The reason for this is unknown. It is assumed that something went wrong in one of the CrowdFlower jobs.

(13)

the annotated final dataset.

Table 5: Mapping of the number of labels per tweet to the number of tweets.

Number of labels per tweet Number of tweets Percentage of total number of tweets (%)

0 127 31.83 1 140 35.09 2 89 22.31 3 35 8.77 4 7 1.75 5 1 0.25 6 0 0.00 7 0 0.00 8 0 0.00 399 100.00

Table 6: Five tweets from the annotated final dataset with their associated label(s).

Tweet Associated label(s)

“@cornaldm @freekvonk @DWDDUniversity

Infotainment wint van coverpulp hoera!” Joy “@peterkwint nou dat is niet best. Al kun je

dit verwachten van de Priv´e. Daar kun je nauwelijks intelligentie verwachten?

Alleen gossip...” Anger, Disgust/contempt

“@jorisluyendijk ligt ie al in de winkel? Ik heb

zondag een boekenbon gevonden! :D” Interest, Joy “@Bertine83 Ja, h`e? En wat een leuke, frisse

CU-dame! Stak gunstig af tegen de tikje

onsympathiek debatterende De Rouwe. Joy, Surprise, Love “@JohnKerstens Het is maar wat je leest,

feit is dat steeds meer mensen in mijn omgeving hun baan kwijt raken of salaris in moeten leveren”

Anger, Fear, Interest, Sadness, Disgust/contempt

3.3 Data Preprocessing

Before features were extracted to train and test the classifier, the tweets in the annotated final dataset had to be preprocessed. The data preprocessing involves four steps that are adopted from Wang et al. (2012, p. 589). First, all the words were lower-cased. Second, user mentions (e.g., ‘@person’) were replaced with ‘@user’. Third, letters and punctuations that are repeated

(14)

Table 7: An overview of the predefined informal expressions and their normalized form.

Informal expression(s) Normalized form

mn, m’n mijn zn, z’n zijn zo’n zo een n, ’n een m, ’m hem t, ’t het idd inderdaad

more than twice were replaced with the same two letters and punctuations. Fourth, and finally, some predefined informal expressions were normalized. Table 7 gives an overview of the predefined informal expressions and their normalized form. Wang et al. also performed a fifth preprocessing step: they stripped hash symbols (since they used the hashtags as the source for their labels). However, since hashtags are used as a feature for the classifier (see section 3.4), this preprocessing step was not implemented.

As an example, Figure 3 shows the result of applying the preprocessing steps described above to the (example) tweet “@persoon JAAA ik heb zo’n zin in de vakantie!!!!”.

Tweet before preprocessing: “@persoon JAAA ik heb zo’n zin in de vakantie!!!!” Tweet after preprocessing: “@user ja ik heb zo een zin in de vakantie!!”

Figure 3: An example of a tweet before and after preprocessing.

3.4 Feature Set

Based on the related work described in Section 2, a feature set was realised that includes the following six features: unigrams, bigrams, hashtags, emoti-cons, unigrams over the Part-of-Speech (POS), and bigrams over the POS. Unigrams and bigrams: N-gram features are used in the studies of Wang et al. (2012) and Balabantaray et al. (2012). We have worked with unigrams (N = 1) and bigrams (N = 2) which were extracted by looping over all the tweets in the annotated final dataset. Punctuations and emoticons were also included and neither stemming nor stop word removal was applied (as in Wang et al. (2012, p. p591)). A tf-idf weighted feature was used for each unigram and bigram, using a sublinear term frequency (tf):

(15)

and a smoothed inverse document frequency (idf) is:

idf (t, D) = 1 + log( N + 1

1 + |d ∈ D : t ∈ D|) (2)

In these equations, t is a unigram or bigram, d is a tweet in the annotated final dataset, N is the total number of tweets in the annotated final dataset, and |d ∈ D : t ∈ D| is the number of tweets in the annotated final dataset that contain unigram or bigram t.

Hashtags: Hashtags were extracted by looping over all the tweets in the annotated final dataset. A tf-idf weighted feature was used for each hashtag.

:) (: :( ): :)) ((: :(( )): :-) (-: :-( )-: :-)) ((-: :-(( ))-: :/ :\\ /: \\: ;) ;-) ;)) ;-)) :p ;-p

Emoticons: Since emoticons are often used to ex-press emotions, it was decided to include them in the feature set. A predefined set of emoticons was con-structed and a tf-idf weighted feature was used for each emoticon in this set. The predefined set of emoticons is shown on the right.

Unigrams and bigrams over the Part-of-Speech (POS): POS features have been proven ef-fective in the studies of Wang et al. (2012) and

Bala-bantaray et al. (2012). Frog3 (Bosch et al., 2007) was used for POS tagging. Though the tagger that is used in Frog is not trained on Dutch tweets (i.e., it is trained on a broad selection of manually annotated Part-of-Speech tagged corpora for Dutch), it is assumed that the tagger is effective for Dutch tweets. The unigrams and bigrams over the POS were extracted by looping over all the tagged tweets in the annotated final dataset. Again, a tf-idf weighted feature was used for each POS-unigram and POS-bigram.

All the features in the feature set have a variable threshold t (except for the emoticons). The value for this threshold determines in how many differ-ent tweets the extracted feature must appear, for example: If the threshold for unigrams is set to three, then all the extracted unigrams must appear in at least three different tweets; extracted unigrams that appear in less than three different tweets are discarded. The thresholds were used to improve the performance of the classifier (more on this in section 3.5).

3.5 Classification Algorithms

There exist multiple problem transformation methods for multi-label clas-sification. Two problem transformation methods were tested: the binary-relevance method (Tsoumakas et al., 2007) and the random k -labelset (RAKEL) method (Tsoumakas and Vlahavas, 2007). The binary relevance method was implemented and tested using scikit-learn (Pedregosa et al., 2011) and the RAKEL method was tested using MEKA4.

3

http://ilk.uvt.nl/frog/ 4_{http://meka.sourceforge.net/}

(16)

Table 8: The initial search grid for testing different combinations of features and classifier parameters for each linear Support Vector Machine (SVM).

Feature Value

Unigrams [yes, no]

Threshold for the unigrams [2, 3, 4, 5]

Bigrams [yes, no]

Threshold for the bigrams [2, 3, 4, 5]

Hashtags [yes, no]

Threshold for the hashtags [2, 3, 4, 5]

Emoticons [yes, no]

POS-unigrams [yes, no]

Threshold for the POS-unigrams [2, 3, 4, 5]

POS-bigrams [yes, no]

Threshold for the POS-bigrams [2, 3, 4, 5]

Classifier parameter Value

Regularization/penalty [L1, L2]

α [0.0001, 0.001, 0.01]

3.5.1 The Binary Relevance Method

The binary relevance method reduces the multi-label classification prob-lem to a single binary classifier per emotion category (i.e., each classifier is trained to distinguish one emotion category from all others). A linear Sup-port Vector Machine (SVM) with stochastic gradient descent (SGD) learning was trained for every emotion category.

Different combinations of features and classifier parameters were tested for each SVM. Table 8 shows the initial search grid. Since it was not possible to test all the combinations defined by this search grid for each SVM5, a different approach was taken which involved testing seven sets of combina-tions.

The first set of combinations that were tested for each SVM included all the unigrams and bigrams as features (i.e., the threshold for unigrams and bigrams was set to one), regularization ∈ {L1, L2}, and α ∈ {0.0001,

0.001, 0.001}. For every combination, an overall F1 score was computed by

averaging over the F1 scores of the SVMs, which in turn were computed by

averaging over ten repeats of five-fold cross-validation on the annotated final dataset.

5_{2 x 5 x 2 x 5 x 2 x 5 x 2 x 2 x 5 x 2 x 5 x 2 x 3 = 1.2e6 combinations for each} linear Support Vector Machine (SVM). Of course, there are lots of redundant combina-tions, for example: When a combination has been tested where ‘Unigrams’ = ‘no’ and ‘Threshold for unigrams’ = 1, than a combination were ‘Unigrams’ = ‘no’ and all other fea-tures/parameters are the same but ‘Threshold for unigrams’, is redundant since unigrams are excluded from the feature set when ‘Unigrams’ = ‘no’.

(17)

The second set of combinations that were tested for each SVM included all the unigrams, bigrams, and hashtags as features, the best regularization parameter from the first set of combinations (i.e., the regularization param-eter used in the combination from set 1 with the highest overall F1 score),

and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

The third set of combinations included all the unigrams, bigrams, hash-tags, and emoticons as features, the best regularization parameter from the first set of combinations, and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

The fourth set of combinations included all the unigrams, bigrams, POS-unigrams, and POS-bigrams as features, the best regularization parameter from the first set of combinations, and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

The fifth set of combinations included all the features (i.e., unigrams, bigrams, hashtags, emoticons, POS-unigrams, and POS-bigrams), the best regularization parameter from the first set of combinations, and α ∈ {0.00001, 0.0001, 0.001, 0.01, 0.1}.

For the sixth set of combinations, the best combination from the fourth set was chosen (i.e., the one with the highest overall F1 score) and the

threshold for all the features in this combination was ranged over {2, 3, 4, 5} (i.e., the same threshold for every feature).

Finally, for the seventh set of combinations, the best combination from the fifth set was chosen and the threshold for all the features in this combi-nation was ranged over {2, 3, 4, 5}.

Table 9 gives an overview of the seven sets of combinations that were tested for each SVM. The results are presented in section 4.

3.5.2 The Random k -Labelset Method

The random k -labelset (RAKEL) method is a problem reduction method that allows for learning dependencies/correlations between labels. Given a set of labels L, RAKEL first creates an ensemble of m k -labelsets. A k -labelset is a subset of L with cardinality k. Given the labelset that is used in this work, the subset {Joy, Love} is a k -labelset with k = 2. The m k -labelsets are randomly selected from all the k -labelsets on L without replacement. m and k are user-specified parameters.

For every selected k -labelset, a label powerset classifier is realised. A label powerset classifier is a binary classifier that is trained on data where instances that contain the labels in the k -labelset are positive, and all the other instances are negative. RAKEL thus creates an ensemble of m label powerset classifiers.

A new instance is classified using a voting scheme. Every label powerset classifier provides a binary decision for the labels in its corresponding k -labelset, for example: If the label powerset classifier associated with the k labelset {Joy, Love} classifies the new instance as belonging to that k

(18)

-Table 9: An overview of the seven sets of combinations that were tested for each linear Support Vector Machine in the binary relevance method. *: is (transitively) the best from set 1.

Set gramsUni- gramsBi- Hash-tags Emoti-cons unigramsPOS- bigrams

POS-Threshold for all the features Regula-rization α 1. x x {1} {L1, L2} {0.0001, 0.001, 0.01}

2. x x x {1} Best fromset 1.

{0.00001, 0.0001,

0.001, 0.01,

0,1}

3. x x x x {1} Best fromset 1.

{0.00001, 0.0001,

0.001, 0.01,

0,1}

4. x x x x {1} Best fromset 1.

{0.00001, 0.0001,

0.001, 0.01,

0,1}

5. x x x x x x {1} Best fromset 1.

{0.00001, 0.0001, 0.001, 0.01, 0,1} 6. x x x x {2, 3, 4, 5} Best from

set 4.* Best fromset 4.

7. x x x x x x {2, 3, 4, 5} Best fromset 5.* Best fromset 5.

labelset, the labels ‘Joy’ and ‘Love’ each get a vote. RAKEL then computes the average decision for each label in L. If the average decision for a label is above some threshold t, the label is predicted for the new instance, thus allowing the prediction of multiple labels. t is a user-specified parameter.

Before the random k -labelset method was tested, the correlations be-tween the labels in the annotated final dataset were computed, which are shown in Table 10. Since there are labels that positively correlate with one another, it was decided to test the random k labelset method with k = 2. Furthermore, it was decided to set m to fourteen: the number of 2-labelsets in the annotated final dataset. It is not possible to set the value of the t parameter in MEKA; the threshold is automatically calibrated (default

(19)

op-Table 10: The correlations between the labels in the annotated final dataset. Label (1) (2) (3) (4) (5) (6) (7) (8) (1) Disgust/contempt - 0.017 0.514 -0.033 -0.158 0.001 -0.030 -0.224 (2) Fear - 0.111 0.002 -0.043 0.338 0.060 -0.058 (3) Anger - -0.016 -0.129 0.131 0.044 -0.174 (4) Interest - -0.016 0.071 0.152 -0.010 (5) Love - -0.054 0.029 0.460 (6) Sadness - 0.115 -0.073 (7) Surprise - 0.068 (8) Joy

-tion) with the possibility to automatically calibrate a threshold per labelset. The default option was chosen. Since it was not possible to use a linear Sup-port Vector Machine with stochastic gradient descent learning as RAKEL’s base learner, it was decided to use a linear Support Vector Machine with sequential minimal optimization with C = 1.0.

The best combination from the binary relevance method was selected (i.e., the one with the highest overall F1 score) and the features that are

included in this combination are the ones that were used by RAKEL. Due to time constraints, it was not possible to test different combinations of features and classifier parameters (as with the binary relevance method). RAKEL’s F1 score was computed according to five-fold cross-validation on

the annotated final dataset. The results are presented in Section 4.

4 Results

The goal of this work is to develop an emotion classification system (i.e., a classifier) for ThemeStreams that can detect and classify multiple emotions in Dutch political tweets. The previous section described the approach that has been taken to achieve this goal. This section describes the results of applying the two algorithms described in Section 3.5 on the annotated final dataset described in Section 3.2.

The F1 scores that are reported in this section are traditional F1 scores

(unless stated otherwise), that is, they are computed as the harmonic mean of precision and recall: F1 = 2 ∗_{precision+recall}precision∗recall, where precision =

true positives

true positives+false positives and recall =

true positives

true positives+false negatives. A tweet

be-longs to the positive class if it has the emotion label in question associated with it.

Seven sets of combinations of features and classifier parameters were tested for the binary relevance method, which are shown in Table 9. An overall accuracy and F1 score was computed for every combination by

(20)

av-eraging over the accuracies and F1 scores of the Support Vector Machines

(SVMs), which in turn were computed by averaging over ten repeats of five five-fold cross-validation on the annotated final dataset. The best combina-tion from every set is reported in Tables 11-13. The best combinacombina-tion from every set is the one with the highest overall F1score. Accuracy scores are not

considered the main evaluation metric, since the annotated final dataset is (highly-)unbalanced, resulting in accuracy scores that tend to overestimate performance.

Tables 11-13 show that set 6 includes the best combination of features and classifier parameters. This combination includes unigrams, bigrams, POS-unigrams, and POS-bigrams, all occurring in at least five different tweets, as features, L1 regularization, and an α of 0.01. The combination

results in an overall F1 score of 0.209, with a standard deviation of 0.147.

The standard deviation is rather high, since the F1 score per emotion

cate-gory varies a lot.

Tables 11-13 also show that the binary relevance method does not learn the ‘Fear’ label at all. The problem with the ‘Fear’ label is that it only occurs nine times in the annotated final dataset (399 tweets), which is too little for the corresponding SVM to learn the label. As a result, the SVM only predicts the absence of the label (with very few exceptions), resulting in a high accuracy score that overestimates the performance of the SVM.

Figure 4 shows a scatter plot of the F1 score per emotion category of the

best combination from the binary relevance method against the percentage label frequency of that emotion category (i.e.,

absolute label frequency of emotion category

total number of labels ∗ 100, see Table 4). The figure shows

that the F1 score increases with the percentage label frequency.

Table 11: The (overall) accuracies and F1 scores for the best combinations

from sets 1-3. *: regularization = L1, α = 0.01 **: regularization = L1, α

= 0.001 ***: regularization = L1, α = 0.001. See Table 9 for the features

that are included in each combination.

Best combination set 1.* Best combination set 2.** Best combination set 3.***

Accuracy F1 Accuracy F1 Accuracy F1

Anger 0.734 ± 0.012 0.194 ± 0.033 0.746 ± 0.008 0.194 ± 0.037 0.749 ± 0.010 0.180 ± 0.044 Fear 0.959 ± 0.006 0.000 ± 0.000 0.945 ± 0.007 0.008 ± 0.024 0.947 ± 0.009 0.000 ± 0.000 Interest 0.634 ± 0.016 0.419 ± 0.029 0.652 ± 0.012 0.439 ± 0.026 0.645 ± 0.026 0.427 ± 0.033 Joy 0.822 ± 0.012 0.220 ± 0.053 0.818 ± 0.011 0.214 ± 0.038 0.816 ± 0.013 0.224 ± 0.026 Sadness 0.941 ± 0.008 0.122 ± 0.067 0.936 ± 0.007 0.147 ± 0.080 0.936 ± 0.008 0.131 ± 0.082 Surprise 0.832 ± 0.011 0.139 ± 0.043 0.835 ± 0.012 0.116 ± 0.031 0.841 ± 0.012 0.127 ± 0.025 Disgust/contempt 0.634 ± 0.017 0.358 ± 0.032 0.621 ± 0.023 0.351 ± 0.035 0.630 ± 0.015 0.361 ± 0.033 Love 0.880 ± 0.012 0.089 ± 0.030 0.877 ± 0.008 0.084 ± 0.047 0.874 ± 0.005 0.087 ± 0.056 Overall 0.804 ± 0.119 0.193 ± 0.130 0.804 ± 0.114 0.194 ± 0.132 0.805 ± 0.114 0.192 ± 0.133

(21)

Table 12: The (overall) accuracies and F1 scores for the best combinations

from sets 4-6. *: regularization = L1, α = 0.01 **: regularization = L1,

α = 0.01 ***: threshold for all the features = 5, regularization = L1, α =

0.01. See Table 9 for the features that are included in each combination.

Best combination set 4.* Best combination set 5.** Best combination set 6.***

Accuracy F1 Accuracy F1 Accuracy F1

Anger 0.713 ± 0.011 0.214 ± 0.038 0.710 ± 0.012 0.220 ± 0.033 0.711 ± 0.019 0.224 ± 0.033 Fear 0.945 ± 0.009 0.000 ± 0.000 0.954 ± 0.012 0.000 ± 0.000 0.946 ± 0.008 0.000 ± 0.000 Interest 0.632 ± 0.020 0.441 ± 0.024 0.629 ± 0.018 0.441 ± 0.033 0.652 ± 0.015 0.478 ± 0.025 Joy 0.805 ± 0.013 0.242 ± 0.041 0.812 ± 0.011 0.276 ± 0.045 0.809 ± 0.016 0.296 ± 0.039 Sadness 0.930 ± 0.013 0.140 ± 0.057 0.928 ± 0.008 0.094 ± 0.035 0.924 ± 0.014 0.069 ± 0.074 Surprise 0.810 ± 0.018 0.136 ± 0.027 0.813 ± 0.014 0.142 ± 0.042 0.817 ± 0.014 0.151 ± 0.050 Disgust/contempt 0.597 ± 0.020 0.321 ± 0.026 0.604 ± 0.020 0.327 ± 0.037 0.588 ± 0.012 0.339 ± 0.017 Love 0.854 ± 0.010 0.069 ± 0.030 0.857 ± 0.010 0.102 ± 0.054 0.857 ± 0.010 0.111 ± 0.047 Overall 0.786 ± 0.121 0.195 ± 0.132 0.788 ± 0.122 0.200 ± 0.134 0.788 ± 0.119 0.209 ± 0.147

Table 13: The (overall) accuracy and F1 score for the best combination

from set 7. *: threshold for all the features = 2, regularization = L1, α =

0.01. See Table 9 for the features that are included in the combination.

Best combination set 7.*

Accuracy F1 Anger 0.692 ± 0.012 0.195 ± 0.031 Fear 0.942 ± 0.010 0.000 ± 0.000 Interest 0.635 ± 0.022 0.456 ± 0.023 Joy 0.796 ± 0.016 0.250 ± 0.049 Sadness 0.922 ± 0.012 0.097 ± 0.073 Surprise 0.823 ± 0.015 0.157 ± 0.070 Disgust/contempt 0.593 ± 0.019 0.332 ± 0.036 Love 0.852 ± 0.012 0.117 ± 0.063 Overall 0.782 ± 0.121 0.200 ± 0.134

The random k -labelset (RAKEL) method was tested using the features that are included in the best combination from the binary relevance method. k was set to two and m was set to fourteen. A linear SVM with sequential minimal optimization was used as RAKEL’s base learner with C = 1.0. RAKEL’s F1 score was computed according to five-fold cross-validation on

the annotated final dataset.

There are two important points regarding RAKEL’s F1 score. First,

MEKA outputs micro- and macro-averaged F1scores. For the micro-averaged

F1 score, precision and recall are calculated by counting the total true

(22)

Figure 4: A scatter plot of the F1 score per emotion category of the best

combination from the binary relevance method against the percentage label frequency of that emotion category.

the harmonic mean of these two figures). For the macro-averaged F1 score,

precision and recall are calculated per class, and then averaged over the classes (the macro-averaged F1 score is the harmonic mean of the average

precision and recall). However, for the binary relevance method, the F1

score was computed over the positive class only (i.e., the emotion label is present), since incorporating the negative class (i.e., the emotion label is not present) results in F1 scores that tend to overestimate performance (since

the annotated final dataset is (highly-)unbalanced).

Second, MEKA does not output the F1 score per emotion category.

It does output precision and recall per emotion category, but since it is not known whether these are computed according to the micro- or macro-averaged method, it is not possible to reliably compute the F1 score per

emotion category as the harmonic mean of precision and recall.

In order to compare the binary relevance method to RAKEL, an overall micro-averaged F1 score was computed for the best combination from the

binary relevance method. This micro-averaged F1 score was computed by

averaging over the micro-averaged F1 scores of the SVMs, which in turn

were computed by averaging over ten repeats of five-fold cross-validation on the annotated final dataset. Table 14 compares the binary relevance method to RAKEL.

(23)

Table 14: The (overall) micro-averaged F1 score for the binary relevance

method and RAKEL. The micro-averaged F1 score per emotion category is

missing for RAKEL.

Binary relevance method RAKEL

Micro-averaged F1 score Micro-averaged F1 score

Anger 0.701 ± 0.022 -Fear 0.947 ± 0.001 -Interest 0.634 ± 0.026 -Joy 0.794 ± 0.011 -Sadness 0.921 ± 0.007 -Surprise 0.808 ± 0.019 -Disgust/contempt 0.571 ± 0.016 -Love 0.856 ± 0.009 -Overall 0.779 ± 0.125 0.912 ± 0.004

5 Conclusion

This work has described the development of an emotion classification sys-tem (i.e., a classifier) for ThemeStreams that can detect and classify multiple emotions in Dutch political tweets. Using data from different installations of ThemeStreams, a hand-labelled dataset of 399 tweets covering eight emo-tion categories has been realised. Using the binary relevance method as problem reduction method, the developed multi-label classifier achieves an overall F1 score of 0.209 on this dataset. This F1 score has been achieved by

testing different combinations of features and classifier parameters, of which the combination that uses unigrams, bigrams, Part-of-Speech-unigrams, and Part-of-Speech bigrams, all occurring in at least five different tweets, as fea-tures, L1 regularization, and an α of 0.01, results in the highest overall F1

score.

Although Table 14 suggests that the random k -labelset (RAKEL) method outperforms the binary relevance method, too many settings/options in MEKA remain unknown to reliably conclude this. Furthermore, it is not known how the overall micro-averaged F1 score of RAKEL is computed

(which has a significantly smaller standard deviation than the overall micro-averaged F1 score of the binary relevance method).

The number of positive instances per emotion category tends to affect the performance of the developed multi-label classifier. Figure 4 shows that the F1 score per emotion category increases with the percentage label frequency

(i.e., the higher the number of positive instances for an emotion category, the higher the F1 score for that emotion category). However, this is true up

to a certain value, since the presence of the emotion category will otherwise prevail.

(24)

6 Discussion and Future Research

There are a number of ways in which the results of this work could be improved. First, the distribution of emotions in the annotated final dataset is (highly-)unbalanced, e.g., the percentage label frequency of the emotion fear is 1.97% (nine positive instances) and the percentage label frequency of the emotion sadness is 3.07% (fourteen positive instances). As a results, the classifier does not perform well on these emotions. The performance could be improved by oversampling tweets from minority emotions.

Second, for the extraction of unigrams and bigrams over the Part-of-Speech (i.e., POS-unigrams and POS-bigrams), a tagger has been used that is trained on a broad selection of manually annotated Part-of-Speech tagged corpora for Dutch. The use of a tagger that is trained on an annotated (Dutch) tweet corpus could result in higher-quality features that can be used to better predict emotions in the tweets.

Third, for the binary relevance method, seven sets of combinations of features and classifier parameters have been tested. Testing more combi-nations of features and classifier parameters could result in a combination that improves the performance of the classifier. Furthermore, incorporating more/other features in the feature set could also improve performance (e.g., lexicon based features, sentiment features).

Finally, it was not possible to reliably compare the results of the random k -labelset (RAKEL) method to the results of the binary relevance method, since too many settings/options in MEKA were unknown. However, the results of RAKEL look promising and more research into MEKA could re-sult in a classifier that outperforms the one that uses the binary relevance method. In addition, an own implementation of RAKEL might provide an alternative approach.

As future work, we intend to use the developed emotion classification system (i.e., the classifier) to examine how the emotions in the tweets affect the political discussions that are visualized by ThemeStreams, e.g., whether the emotions in the tweets propagate through the political discussions and whether certain topics are more loaded than others. This should provide further inside into the political discourse and the Dutch political landscape.

References

Alm, C. O., Roth, D., and Sproat, R. (2005). Emotions from text: Machine learning for text-based emotion prediction. In Proceedings of the Confer-ence on Human Language Technology and Empirical Methods in Natural Language Processing, HLT ’05, pages 579–586, Stroudsburg, PA, USA. Association for Computational Linguistics.

(25)

twitter emotion classification: A new approach. International Journal of Applied Information Systems, 4(1):48–53.

Bosch, A. v. d., Busser, B., Canisius, S., and Daelemans, W. (2007). An efficient memory-based morphosyntactic tagger and parser for dutch. LOT Occasional Series, 7:191–206.

Buitinck, L., van Amerongen, J., Tan, E., and de Rijke, M. (2015). Multi-emotion detection in user-generated reviews. In Hanbury, A., Kazai, G., Rauber, A., and Fuhr, N., editors, Advances in Information Retrieval, volume 9022 of Lecture Notes in Computer Science, pages 43–48. Springer International Publishing.

Danisman, T. and Alpkocak, A. (2008). Feeler: Emotion classification of text using vector space model. In AISB 2008 Convention Communication, Interaction and Social Intelligence, volume 1, page 53.

de Rooij, O., Odijk, D., and de Rijke, M. (2013). Themestreams: Visualizing the stream of themes discussed in politics. In Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’13, pages 1077–1078, New York, NY, USA. ACM.

Ekman, P. (1992). An argument for basic emotions. Cognition & emotion, 6(3-4):169–200.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. (2011). Scikit-learn: Machine learning in python. The Journal of Machine Learn-ing Research, 12:2825–2830.

Shaver, P., Schwartz, J., Kirson, D., and O’connor, C. (1987). Emotion knowledge: further exploration of a prototype approach. Journal of per-sonality and social psychology, 52(6):1061.

Tsoumakas, G. et al. (2007). Multi label classification: An overview. Inter-national Journal of Data Warehousing and Mining, 3(3):1–13.

Tsoumakas, G. and Vlahavas, I. (2007). Random k-labelsets: An ensemble method for multilabel classification. In Machine learning: ECML 2007, pages 406–417. Springer.

Wang, W., Chen, L., Thirunarayan, K., and Sheth, A. P. (2012). Harnessing twitter” big data” for automatic emotion identification. In Privacy, Se-curity, Risk and Trust (PASSAT), 2012 International Conference on and 2012 International Confernece on Social Computing (SocialCom), pages 587–592. IEEE.

Multi-Emotion Detection in Dutch Political Tweets