• No results found

A Sentiment Analysis of Civil Tweets for Crisis Management Evaluation

N/A
N/A
Protected

Academic year: 2021

Share "A Sentiment Analysis of Civil Tweets for Crisis Management Evaluation"

Copied!
90
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

A Sentiment Analysis of Civil Tweets for Crisis

Management Evaluation

Author: Leonardo Losno Velozo

Study: Information Science

University of Groningen

(2)
(3)

A Sentiment Analysis of Civil Tweets for Crisis

Management Evaluation

Author: Leonardo Losno Velozo

Study: Information Science University of Groningen

Course: Master Thesis

Subject: Sentiment Analysis for Crisis

Management Evaluation

Supervisor: L.M. Bosveld-de Smet

Datum: 12 January 2018

(4)
(5)

i

Preface

This paper is the Master thesis of Information Science at the University of Groningen. The research subject originates from my internship at ‘Veiligheidsregio Fryslân’. At the time of my internship they were researching how to implement the citizens perception in the evaluation of crises. Because of my experience with digital data and communication, they asked me to research the possibilities of using social media to map the perception of citizens. Because of my special interest for Machine Learning, I decided to focus on a Sentiment Analysis using Supervised Machine Learning.

I would like to thank my thesis supervisor Leonie Bosveld-de Smet, for the support and feedback during the preparation and creation of the thesis. I also like to thank Malvina Nissim and Rob van der Goot for the support during the preparation of my thesis proposal. Finally, I would like to thank ‘Veiligheidsregio Fryslân’, in particular Rika Leijstra, for providing the subject of my thesis and for the support during the preparation of my thesis proposal.

(6)

ii

Abstract

After the occurrence of a crisis Dutch Security Regions evaluate in order to improve the performance of crisis management. This is done by only taking the expertise of responsible crisis organizations to account. However, the Security Regions acknowledge the importance of adding the perception of citizens into this phase. Therefore they are interested in the usefulness of social media to map the perception of citizens. Mapping this information manually is practically impossible, because of the substantial amount of crisis messages that are sent during crises. That is why we decided to focus on a Supervised Machine Learning Solution. A Sentiment Analysis (SA) fits perfectly with this need, since it is engaged in the automatic process of mining peoples’ expressions (e.g. opinions, emotions and attitudes) from text through Natural Language Processing (NLP). Based on this, we decided to perform a SA on Tweets using a Support Vector Machine (SVM) algorithm with term frequency-inverse document frequency (TF-IDF) n-grams (i.e. unigram, bigram, trigram) as feature set, to find out if the citizens’ perceptions during crises can be mapped in this way.

Our SA analysis is performed on sentence-level, and consist of two tasks (i.e. subjectivity and polarity classification) which are performed separately. The subjectivity classes are subjective (sub) and objective (obj), and the polarity classes are positive (pos), neutral (neu) and negative (neg). The compound of these tasks result in six final output classes (i.e. obj-pos, obj-neu, obj-neg, sub-pos, sub-neu, sub-neg) which are assigned in a one-to-one relation per sentence. Even though in most SA studies the objectivity compounds are discarded when a neutral class is added, we preserve both subjectivity compounds. In this way we aim to make distinction between sentences that have a sentiment expressed explicitly (i.e. subjective compounds), and sentences that tempt to provoke a certain sentiment without being expressed explicitly in the message (i.e. objective compounds). This classification approach aligns with the primary objectives of crisis communication, which are focused on the information needs of citizens during crises. That is why we also assume to align with the perception of citizens during crisis. Moreover, the merge of all sentences’ compounded classes per tweet results in a classification on tweet-level. After applying certain rules (e.g. all objective compounds are discarded if a tweet includes a subjective compound), this results in 14 final compounded tweet-classes.

We aim to influence the performance of the system by varying in specificity and corpus size. Since there exists a trade-off between the two, we intent to overcome this phenomenon by normalizing the data in different ways. In this way we found different performance trends which confirm the trade-off statement: performance increases with more specific and larger corpora; normalization improves

(7)

iii

the performance for all data compositions; and the largest improvement (i.e. between raw and normalized data) appears in data compositions that train with data of previous crises, excluding the data of the actual crisis. A logical conclusion is that large, specific, normalized corpora result in the best performing systems. For example, the best performing subjectivity system has 87% of the predictions correctly classified (i.e. Precision = 0.87). At the same time, on average, 86% of all sentences classes are predicted correctly (i.e. Recall = 0.86). Finally, the F1-sore, which is the weighted harmonic mean of precision and recall, is 0.86. The performance of the other best performing systems are: sentence-level polarity (P = 0.74, R = 0.72, F = 0.70), sentence-level compound class (P = 0.65, R = 0.63, F = 0.62), and tweet-level compound class (P = 0.60, R = 0.57, F = 0.56).

(8)

iv

Contents

Preface ... i Abstract ... ii 1. Introduction ... 1 2. Background ... 2

2.1 Dutch Crisis Management ... 2

2.1.1 Organization ... 2

2.1.2 Crisis communication ... 3

2.1.3 Crisis Definition: GRIP ... 3

2.1.4 Crisis Types ... 4

2.2 Twitter ... 5

2.3 Machine Learning ... 6

2.3.1 Basic principle and terms ... 6

2.3.2 ML Techniques ... 7

2.3.3 Supervised learning algorithms ... 8

2.3.4 Evaluation ... 9

2.3.5 Trade-off between specificity and corpus size ... 9

2.4 Sentiment Analysis ... 10

2.4.1 Classification tasks: subjectivity and polarity ... 10

2.4.2 Annotation Process ... 11

2.4.3 Topic-focus structure ... 12

2.4.4 Preprocessing ... 13

2.4.5 Feature selection... 15

2.4.6 Algorithms and Performance ... 16

3. Data ... 17

3.1 Data domain: Crisis tweets ... 17

3.2 Data collection process ... 18

3.2.1 Filter relevant documents ... 18

3.2.2 Filter unique tweets ... 19

3.2.3 Sentence split ... 20

4. Method ... 22

(9)

v

4.2 Operationalization and Annotation Process ... 24

4.2.1 Sentence-level ... 24 4.2.2 Tweet-level ... 29 4.3 Tokenization ... 30 4.4 Normalization ... 31 4.5 Evaluation ... 34 4.5.1 Data compositions ... 34

4.5.2 Five-fold cross validation ... 35

5. Results ... 36

5.1 Sentence-level classification ... 36

5.1.1 Subjectivity ... 36

5.1.2 Polarity ... 38

5.1.3 Compound: subjectivity and polarity ... 40

5.2 Tweet-level Classification ... 41

6. Conclusion & Discussion ... 43

7. Future Work ... 46

8. Recommendations for Security Regions ... 47

8. References ... 48

Appendix 1: Codebook ... 50

Appendix 2: Annotation Complexities on Sentence-Level... 52

Appendix 3: Data Distribution ... 53

A.3.1 Sentence Per Tweet ... 53

A.3.2 Tweet level: Goldstandard Classes per Tweet ... 54

A.3.3 Sentence level: Goldstandard Classes Per Sentences ... 57

A.3.4 Class Per Tweet in Detail ... 59

Appendix 4. Sentence-Level: Objectivity Results ... 68

Appendix 5: Sentence-Level – Polarity Results ... 74

Appendix 6. Sentence-Level: Compound Class Results ... 80

(10)

1

1. Introduction

The massive increase of user-generated data on the internet provides opportunities, for both private and public organizations, to detect public perceptions. Extracting attitudes, opinions and sentiments is one way to achieve this. This can be especially useful for information analysis in government, commercial and political domains. For public organizations the main interest concerns civil experience. How do people feel about a specific event? What is the range of feelings and opinions being expressed? When do these feelings and opinions appear and intensify? Which kind of actions are taken by the public? What is their need of information? Which sources do they trust? Discovering answers related to these kind of questions in a large amount of digital data can be very time consuming when doing it by hand. That is why there has been an increased interest in the automatic identification and extraction of this kind of information, based on text.

One public domain where the automatic classification of attitudes, opinions and sentiments can be very useful is crisis management. The main entities responsible for crisis management in the Netherlands are the so-called security regions. They are engaged in the strategic and operational effectiveness of crisis management. Part of their responsibility is to evaluate the effectiveness of crisis management after each crisis, aiming to improve crisis management. At this moment this is done by only taking the expertise of responsible crisis organizations into account. However, the civil experience during crisis can provide very useful additional information.

One way of detecting civil experience is by looking at social media messages. Security regions are interested in finding out if these messages contain relevant information that can be used in the evaluation phase, and if so, how to automatically identify and extract this information. A Sentiment analysis (SA) seems to fit perfectly with these needs, as it involves the automatic process of mining opinions, emotions, views and attitudes from speech and text through Natural Language Processing (NLP). On this basis the following research question is formulated:

RQ: To which extent civil experience during a crisis can be automatically detected, performing a

SA on twitter-data, so that it can be used as input for crisis evaluation by Dutch security regions? In the following paper we will first discuss related work about crisis communication and sentiment analysis. Second, we will describe our data and explain some pre-processing we performed on it. Next we will give a description of our method including the operationalization process, model choice, normalization process, feature choice and evaluation approach. Finally, after reviewing our results, we will give our conclusion and propose some future work.

(11)

2

2. Background

In this study we will apply a Sentiment Analysis on different compositions of Twitter data, aiming to represent the experience of citizens during (Dutch) crises. Since there is a lot of data shared by citizens during crises, a manual analysis is very time consuming and therefore almost impossible for the crisis management to perform in practice. That is why we will perform an automatic analysis based on a Supervised Machine Learning system. In this, we will use data compositions of different sizes and specificity to investigate the possibilities of reusing the data of other crises to improve the performance of the analysis of a new crisis.

In short this means that our study can be subdivided in the following parts, which will be explained in more detail in the following subsections:

- Crisis management and crisis communication in the Netherlands (2.1) - Twitter main characteristics (2.2)

- Machine Learning (2.3) - Sentiment Analysis (2.4)

2.1 Dutch Crisis Management 2.1.1 Organization

The main entities responsible for crisis management in the Netherlands are the so-called Security Regions (SR). A SR is a geographically based network organization in which municipalities and emergency services work together with other organizations (e.g. vital public and private organizations that can be crucial partners during crisis) to guarantee public health, emergency response and crisis management.

(12)

3

SR’s are engaged in both, the strategic and operational effectiveness of crisis management. Their role is mainly facilitating. Their responsibilities in this are diverse:

- Center of expertise of crisis management

- Facilitate preparation before a crisis (e.g. train crisis officials, compose crisis plans) - Support cooperation between organizations in all crisis phases (i.e. before, during, after)

2.1.2 Crisis communication

One of the crucial responsibilities during a crisis is crisis communication. Crisis communication is concerned with the public information needs. These information needs are related to the three primary objectives of crisis communication as described by Regtvoort and Siepel (2009): information prevision, harm reduction and sense making. Information prevision is focused on general factual information related to a crisis. Information concerning harm reduction are instructions to limit damage. Finally, sense making is related to persons’ sentiments, emotions and concerns. In short we can say that the first two are focused on objective information about a crisis, and the last one on subjective information related to a person’s expression.

2.1.3 Crisis Definition: GRIP

Table 1: Scaling levels (Regtvoort & Siepel, 2009)

GRIP-level Scope of incident Responsible organization(s) GRIP 0 Local incident, daily routine Monodisciplinary consultation

GRIP 1 Source control Commando Source Area (COPI)

GRIP 2 Source and effect control Regional Operational Team (ROT)

GRIP 3 Civil health threat Municipal Policy Team (GBT)

GRIP 4 Municipality crossing incident Regional Policy Team (RBT) GRIP 5 Superregional incident Interregional Policy Team (IRBT)

GRIP RIJK National security disputed Ministerial Commission Crisis Control (MCCB)

Crisis is a term that is often used in a wide range of subjects (e.g. psychology, economics, politics, environment). Yet, in our case it is highly related to the work of emergency services. To be more specific, crisis communication experts Regtvoort and Siepel (2009) explain this term as the disturbance of the daily life due to an unexpected event, which is a threat to the health, safety or wellbeing of civils. For our study this definition is still too general. We need a more specific definition which in a way covers the extent of (civil) impact of a crisis. This is why we will use GRIP (i.e. Gecoördineerde Regionale Incidentbestrijdings Procedure) as starting point.

(13)

4

GRIP is a Dutch method to optimize the multidisciplinary cooperation between government, emergency services, companies and other involved organizations during crises. That is to say, when multidisciplinary cooperation is needed, these organizations have to switch between their own daily work to a coordinated organization that can act quickly and decisively. GRIP serves to make this switch in a way that it consists of several scaling levels which are linked to administrative and operational authorities and responsibilities during crises (Regtvoort & Siepel, 2009). Table 1Shows an overview of all GRIP levels, the corresponding scope of incident and responsible organizations. To make it more understandable, we will give an example of different GRIP levels for a fire incident. A regular fire which is mainly managed by the fire department is a GRIP 0 incident. When a more intense cooperation between the emergency services is needed, the GRIP level is scaled up to 1. This can be the case (for example) because of the size of the fire or the kind of location (e.g. a prison, train station, school, asylum center). If a larger area (than the source area) is affected by the fire and more tactical measures are needed, GRIP is scaled up to level 2. For example if people of surrounding areas need to be informed because of the spread of toxic smoke. GRIP 3 is the case when there are drastic consequences for society, for example many casualties or lots of damage. If a fire affects other municipalities there is a scale up to GRIP 4, and if it affects more provinces there is a scale up to GRIP 5. In those cases there is a need of multidisciplinary and administrative coordination. The final GRIP level is GRIP RIJK. This applies when involvement of the state is needed because of a threat of national security (e.g. if the fire is a consequence of a terrorist attack).

2.1.4 Crisis Types

Now that we have a defined the term crisis, we need to distinguish different types of crises. In this way we aim to group crises whose experiences are described with a similar vocabulary. After all, we seek to reproduce the citizens’ perception during crisis based on a text-based analysis. Regtvoort & Siepel (2009) make such a distinction summarizing 19 types of disaster which are similar to each other in terms of impact and development:

- aviation accidents - accidents on water - traffic accidents on land

- accidents with combustible/explosive substance in open air - accidents with toxic substance in open air

- nuclear accidents - threat to public health

(14)

5 - epidemy

- accidents in tunnels - fire in big buildings - collapse of big buildings - failure of utilities

- panic in crowds

- large-scale public disturbance - floods - wildfires - extreme weather - disasters at distance - terrorist threat 2.2 Twitter

Table 2: Twitter characteristics (Stieglitz & Dang-Xuan, 2012) Characteristic Types Main Characteristics

Platform single author messages, 140-chars per message, no facility for post-editing, following relations, real-time network

Usage sharing news, promoting political views, marketing, tracking real-time events Content phrases, quick comments, images, videos, links

User-accepted norms - @username: mark addressee

- Retweet: resend tweets using ‘RT’/‘via’/‘by’ followed by @username

- #word: to tag and search tweets of specific topics

Our text-based analysis will be based on Twitter data. Twitter is a social media micro-blogging platform that is mainly characterized by documents (i.e. Tweets) with 140 characters in length which are not editable and are composed by single authors. The platform consist of a real-time network with follower-following relations between users, where mostly topics of common interest are shared in a quick way. Stieglitz & Dang-Xuan (2012) point that it is said to be the most popular microblogging platform, and that it is mainly used to share news, promoting political views, marketing and tracking real time events. Twitter is also characterized by some popular user-accepted norms, like marking an addresses with a @-character, resending tweets (i.e. retweet) and tagging tweets with an hashtag (#). Table 2 summarizes the main characteristics we mentioned above. These are by the way mainly inspired by the work of Siteglitz & Dang-Xuan (2012).

(15)

6

Another more general but important characteristic is the concerns the noisiness of tweets. Baldwin et al. (2013) investigated the noisiness of different social media messages (including Twitter). Their main findings about Twitter were: sentence length is between 6 and 9 words on average, half that of the more formal social media (e.g. Wikipedia); Informal text and “ad hoc” spellings are more prevalent in Twitter than in other platforms; Twitter is most likely to contain ungrammatical sentences; and Twitter includes reasonably homogeneous data (i.e. intrinsically similar style and content) across time, despite of its real-time nature.

Additionally Rao (2010) states several challenges arising from the noisiness of Twitter posts. First of all he mentions the limited size (i.e. sparsity and limited post-length) of tweets, which makes automatic classification more difficult because of the lack of context. Secondly, tweets are more difficult to process (for a machine) because of the informal nature of writing (e.g. slang, abbreviations, spelling mistakes). A third challenge is the lack of prosodic cues (e.g. intonation, tone, stress, and rhythm) compounded with the short amount of text, which makes it difficult to interpret messages (e.g. humour and sarcasm). The final challenge has to do with the departure from traditional sociolinguistic cues (e.g. utterances like “uh-huh” and “umm”, and back channel responses like laughter), which probably don’t hold for the relative new communication genre of micro-blogging. In this case other sociolinguistic cues can be applicable (e.g. the presence of exclamation mark sequences or emoticons).

2.3 Machine Learning

2.3.1 Basic principle and terms

Machine learning (ML) is about creating automatic learning computer programs that can improve performance through experience (Mitchell, 1997). The experience is gained from a collection set of real examples (i.e. instances), which are characterized by properties that describe an instance (i.e. set of features) and a result (i.e. classes). These examples need to be turned in a machine-readable format, called vectors of feature values. Based on a specific ML technique (i.e. algorithm or model), the machine tries to recognize feature patterns in a subset of instances (i.e. training set) to create models of prediction that can describe unseen examples (i.e. test set) with a certain reliability.

Example: Is there a high risk for a fire-break? Instances: data of fires per day for 5 years Features: date, place, temperature, humidity Result: Yes / No

(16)

7

Mitchell (1997) emphasizes the great practical value of ML algorithms for a variety of application domains. According to him it is especially useful in three specific cases:

1. Data mining problems in big databases, in which valuable implicit regularities can be discovered automatically (e.g. analysis of medical treatment outcomes, based on patient databases).

2. Complex domains where humans are incapable to develop effective algorithms because of a lack of knowledge (e.g. image/audio recognition).

3. Dynamic domains in which programs have to adapt to changing conditions (e.g. manufacturing process control under changing supply stocks)

Our research is focused on the first domain. To be more specific it covers an analysis that is performed on big text-based databases (i.e. Twitter data). This focus defines a more specific terminology of certain terms that are often used interchangeably. Therefore we need to clarify some important terms related to our research, namely: channel, message, corpus and documents.

First of all we use the model of Berlo (1960, in Pieterson 2009) to define channel and message: “… Berlo (1960) developed the ‘S-M-C-R’ model, which describes the information flow (message) from a sender via a channel to a receiver. He viewed the channel as the pipeline through which the information is pushed.”

In the context of our study this means that the Twitter medium will be referred as channel, and the content of a tweet as a message. Secondly, to determine the meaning of document and corpus we will use the definition of Manning et al. (2008). They describe a document as a unit that is used to build a retrieval system over (e.g. books, products, e-mails), and a corpus as the group of documents over which retrieval is performed. So in our case documents are the raw single tweets and corpus the group tweets which will be analyzed.

2.3.2 ML Techniques

ML can be divided into two main techniques: unsupervised learning and supervised learning. In the following subsections we will describe these techniques in short, emphasizing the main characteristics that distinguish these techniques from each other.

(17)

8 2.3.2.1 Unsupervised Learning

In unsupervised learning there are no assigned documents to predetermined classes. In other words, it tries to find structure in unlabeled data. This technique therefore mainly relies on clustering, that is: group a set of documents into subsets (i.e. clusters), which are as similar as possible but different from each other. The membership to a cluster is then determined by the distribution and makeup of the data, and its key input is the distance measure (e.g. Euclidean distance). As a consequence this technique can be influenced by features, number of clusters and distance measure. Figure 2 shows an example of a data set with three clear clusters, in

which the distance measure is the distance in a document space represented in a two-dimensional coordinate system (Manning et al., 2009). The axes are for example a polarity and subjectivity score, which can be calculated based on a word lexicon (i.e. each word is assigned a subjectivity and polarity score) and the presence of those words in a document. In this case the left upper cluster is subjective-negative, the right upper cluster subjective-positive and the lower middle cluster objective-neutral.

2.3.2.2 Supervised Learning

On the other hand, in supervised learning the distribution of classes (i.e. labels or categories) is known. These classes are defined by human and assigned to documents. That is why this technique is called supervised, since the learning process is directed by a human supervisor. The required output and dependent variables define the application of a supervised learning method. In general it can subdivided in two kind of problems: regression and classification. For regression problems the dependent variable is numerical and its output is continuous (e.g. predict house prices depending on the distance to the city center). Classification problems on the other hand, consist of categorical dependent variables and classes as output (e.g. predict the road traffic accident risk depending on the place, time and weather).

2.3.3 Supervised learning algorithms

A machine learning algorithm defines how exactly instances are observed and common patterns are found in the data. The selection of an algorithm depends on several factors, among others: data domain, domain knowledge, dependent variables, output variables, and amount of available data (Manning et al., 2009). In our study we aim to classify tweets into a sentiment class (2.4.1) depending on text-based features. Based on our 2.3.2 we can conclude that we are dealing with a Figure 2: Data set example with 3 clear clusters

0.25 0.5 0

(18)

9

supervised learning problem. The model that implements classification is called a classifier. Examples of classifiers are Naive Bayes (NB), Support Vector Machines (SVM), Neural Network (NN), and maximum entropy (ME). Since the ‘data domain’ and ‘domain knowledge’ are important factors for the selection of an algorithm, we will base our choice on the outcome of previous research about Sentiment Analysis (2.4.7).

2.3.4 Evaluation

Each Machine Learning method requires a training set of documents as input to return a classification function that can be applied to an unseen test set to check its performance. The main statistics which are often used for evaluating machine learning systems are: accuracy, precision, recall, and f1-score (Manning et al., 2009).

Accuracy. Fraction of correct classifications.

A = correctly predicted instances / total instances

Precisionx. Fraction of the predicted results that is correct, for instances assigned to classx.

P = correctly predicted instancesx / predicted instancesx

Recallx. Fraction of instances that is well predicted, for instances assigned to classx.

R = correctly predicted instancesx / total instancesx

F1-scorex. Weighted harmonic mean of precisionx and recallx.

F1 = 2PR/(P+R)

2.3.5 Trade-off between specificity and corpus size

Finally, in the context of our study it is important to mention the trade-off between the amount of labelled data and specificity. On the one hand, having enough classified examples is crucial for a ML-system to be able to learn. On the other hand, the more specific the data are, the better a machine learning system is able to recognize common patterns that explain the data, since the feature vectors will be more similar to each other. After all, the meaning of a word or sentence can depend on the domain (Kharde and Sonawane, 2016). For example, in the domain of crisis ‘explosive’ is a negative word, but in the context of sports it can be a positive word to describe an action. So both, specificity and a larger amount of data, contribute to a higher performance of an automatic classification system.

But more of one means less of the other. That is to say, if we focus on specificity we need more labelled data about a specific topic, and therefore we cannot reuse labelled data about other topics. At the same time including labelled data of different topics results in larger less specific corpora. The necessary minimum size of ML-corpora can be very large and the process of annotating very

(19)

10

time consuming. Therefore it would be beneficial to be able to reuse previously annotated data, overcoming the trade-off problem between amount of labelled data and specificity. Pennacchiotti and Popescu (2011) show that class-specific models outperform generic topic models. However, researchers like Barbosa and Feng (2010) show that with the use of more generic features (like meta information and syntactic features) more abstract representation can be captured, resulting in an effective and robust system for automatic text-analysis tasks. The main advantages of this second approach (in comparison with the specific model) is that it is less sensible to bias and noise, and needs a much less number of labelled data. Besides, a wider range of topics can be used to train a generic model, saving a huge time cost in manually annotating data.

2.4 Sentiment Analysis

Kharde and Sonawane (2016) define sentiment analysis (SA) as the automatic process of mining opinions, emotions, views and attitudes from speech and text through Natural Language Processing (NLP). They point on that it includes several tasks as subjectivity- and sentiment classification, sentiment extraction, summarization of opinions, and opinion spam detection among others. When it comes to Twitter, previous research even emphasized the capacity of mapping real publics’ sentiment (Stieglitz & Dang-Xuan, 2012; Anjaria & Guddeti, 2014). In our paper we will combine two SA tasks: subjectivity and polarity classification.

2.4.1 Classification tasks: subjectivity and polarity

In 2.1.2, we described the primary objectives of crisis communication, and concluded that it can be divided in a binary classification of subjective (i.e. sense making, involving people’s expressions) and objective (i.e. information provision and harm reduction) class. This is what is called a subjectivity classification: discriminating opinions and other forms of subjectivity from objective factual information (Kharde & Sonawane, 2016). Many SA research add a polarity classification (i.e. positive, neutral, negative) to the subjective class in the subjectivity classification. This approach inspired us to perform an additional classification to provide a more detailed citizens’ perception classification.

SA which combine the subjectivity and polarity classification tasks mainly differ in the implementation order of these tasks and their choice of output classes. For example some classified their data in a three-way model having positive, negative and neutral or non-opinion or objective classes (Pak and Paroubek, 2010; Agarwal et al., 2011; Po-Wei Liang et all., 2014; and Pablo et al. 2014, in Kharde & Sonawane, 2016). Others performed a two phase automatic SA, by firstly classifying the subjectivity, and then classifying the subjective class as positive or negative (Barbosa et al., 2010). Finally, another method we found is that of Pablo et al. (2014, in Kharde &

(20)

11

Sonawane, 2016) who build a binary classifier of positive and negative classes rejecting the data which are considered as neutral.

An outstanding finding concerning the choice of output classes is the relation between the neutral and objectivity class. Tsytsarau & Palpanas (2012) mention that, if a neutral class is added, the subjectivity classification is discarded as it is treated as the objective class. The research examples we mentioned above affirm this statement, showing that no polarity classification is performed on the objective class. This class is treated fully as neutral. This is striking, as we can imagine that an objective class can include a polarity implicitly. For example, a message about a ‘fire taking place’ can be interpreted as objective with a negative polarity implicitly, and one mentioning ‘no casualties’ as objective with an implicitly positive polarity. In this paper we will apply this extended SA classification, since it results in an even more detailed classification of the citizens’ perception. More on this will be explained in the 4.3.2.1.

2.4.2 Annotation Process

In 2.3.3.2 we saw that supervised learning needs manually annotated data to train the system in creating models of prediction that allows to classify a test set of unseen examples with a certain reliability. In the context of a SA on crisis tweets, the annotation process needs special attention, because we assume to encounter tweets and sentences with multiple sentiments. In this section we will explain the following two approaches that can help to overcome this problem: levels of granularity and the topic-focus structure.

2.4.2.1 Levels of granularity

Classification tasks can be performed on different levels of granularity. Kharde & Sonawane (2016) describe four levels of granularity regarding SA: document level, sentence level, feature level, and word level. Document level assigns a class to the whole document. In our case this means that a whole tweet is classified to one class. The classification is performed on sentence level if each sentence is tagged with a class. For us this would require to split each tweet in sentences and assign a class to each tweet-sentence. Feature level classification deals with the extraction of features representing a class (e.g. sentiment) and the identification of the entity towards the class is directed. For example the sentence ‘Great work of the firefighters’ would result in a subjective-positive class that is directed to the fire department entity. Finally, word level classification aims to assign a class to each word, to create vocabularies which can be used as input for other tasks. In the context of polarity adjectives and adverbs are mostly used as features. For example ‘beautiful’ and ‘gently’ can be classified as positive words. But also other part of speech can be assigned a polarity. For example the nouns ‘fire’ and ‘party’ can be respectively classified as negative and positive.

(21)

12

Our approach with regard to this, will mainly depend on our desired outcome and the composition of our data. Our classification is aimed at getting an overview of the perception of citizens during crises performing a SA on civil tweets. Therefore we would need a total overview of the sentiments which are expressed during a specific crisis. A logical approach would therefore be to classify on tweet level, as it expresses the sentiment of a user in the context of a tweet. But what if a tweet includes multiple sentiments? For example ‘Terrible accident. Luckily that there are no casualties’. In this case you would like to assign two sentiments to one tweet, otherwise you would miss a sentiment that is expressed. The granularity approach could solve this, for example by classifying on sentence level and combine them together to represent the sentiment per tweet. A pilot study could give us more insight in the composition of crisis tweets, in order to be able to select an appropriate granularity approach.

2.4.3 Topic-focus structure

Classifying on sentence level could overcome the problem of multiple sentiments per tweet, whereas most sentences consist of one single sentiment. But how to deal with those sentences that contain multiple sentiments? For example the sentence ‘Strength to the victims of the terrible accident’. In such cases it would be desirable to assign an overall predominant class which represents the intended sentiment expressed in a specific sentence. The topic-focus structure of Gundel & Fretheim (2004) offers a solution regarding this problem.

They propose that information structure is essential for the information processing function of language, as it is an important element of the semantic/conceptual representations associated with sentences by the grammar. The topic-focus structure is a semantic-conceptual representation of a sentence, which makes a distinction between given and new information. The topic in this sense is what the sentence is about and the focus is the predicate about the topic. In other words, the topic is independent in relation to the focus, and the focus is new (information) in relation to the topic. Concerning our annotation process we will consider the topic as the key indicator of a sentiment, since it is the core of the sentence.

Because of the lack of context in tweets we will apply a generic interpretation of a sentence topic and focus. One of the most generic interpretations would be where topic and focus coincide with the grammatical subject and predicate. Another general interpretation is where the topic is to the extreme left or right of a sentence focus. In the example we give bellow the last interpretation is applied (i.e. in sentence1 the topic is to the extreme left of the focus and in sentence2 to the extreme

right). In this example we also show how the polarity can change in a similar discourse, because of a different topic-focus structure.

(22)

13

2.4.4 Preprocessing

The noisiness of Twitter posts (2.2) and the involvement of data compositions of different crisis types (2.1.4) make the classification of the raw data susceptible to inconsistency and redundancy (Kharde & Sonawane, 2016), and may typically reflect in lower performance of state-of-the-art approaches (Kornek & Šimko, 2014). Preprocessing can overcome these problems as it cleans the data extracting relevant elements and making it more uniform.

2.4.4.1 Tokenization

One of the main preprocessing steps is tokenization, which is the process of chopping character streams of each tweet into relevant tokens (Manning et al., 2008). This process can take place on different levels of granularity, for instance on sentence and word level. Kornek & Šimko (2014) applied both levels of tokenization in their research about sentiment analysis on microblogs, and showed that sentiments are more accurately identified when microblog posts are split in sentences. In formal language sentence split is a straightforward process, since sentences have a clear and uniform pattern (e.g. sentences start with a capital and end with a specific punctuation like a single point, exclamation mark or question mark). For Twitter it is a more complex process because of its alternative and informal language. In their paper about potential features on Twitter, Rao et al. (2010 ) summarize some lexical cues which alternate from formal language, of which some are relevant for splitting sentences. For example repeated punctuation (‘!!!!!’ and ‘?????’), puzzled punctuation (‘!?!?!?’) and smileys (e.g. ‘;)’ and ‘:P’). These examples show that an alternative sentence split technique must be applied for Twitter: sentences containing one of these examples should not be split after single punctuations, but after the group of such tokens.

Example of different topic and focus in a similar discourse Sentence1: All victims were saved in a terrible accident

Topic: all victims were saved Focus: terrible accident Polarity: positive

Sentence2: Victims were saved in this terrible accident

Topic: terrible accident Focus: victims were saved Polarity: negative

(23)

14

Grouping relevant sequences of tokens is what word level tokenization is about. The difference with formal language in this, has mainly to do with microblog specific features like emoticons, slang expressions (e.g. ‘w8’ for ‘wait’), hashtag words (i.e. #word), user mentions (i.e. @user), URL’s (e.g. https://www.twitter.nl), repeating characters (e.g. ‘!!!!’), puzzled punctuations (e.g. ‘!!!???’) and abbreviations (Kornek & Šimko, 2014; Rao et al., 2010). The tokenization process of microblog posts should take these features into account. Another important tokenization strategy has to do with compounded sequences which are separated by a specific token but typically belong to each other (Manning et al., 2008). For example words including an apostrophe (e.g. ‘auto’s’), compound words containing a compound-splitter (e.g. ‘Noord-Holland’, ‘km/h’), and compound numbers containing a compound-splitter (e.g. time, date and telephone number).

2.4.4.2 Normalization

Normalization is the classing of terms. To be more specific, Manning et al. (2008) describe it as “canonicalizing tokens so that term matches occur despite superficial differences in the character sequences of tokens”. They state that the most standard way of normalizing is by implicitly creating equivalent classes or by maintain relations between unnormalized tokens (e.g. set up a synonym lexicon). Besides, there are some more sophisticated normalization rules. Down here we will summarize some main normalization rules which are applied in different SA research (Kharde & Sonawane, 2016; Kornek & Šimko, 2014; and Mourad & Darwish, 2013):

- Emoticons. Replace emoticons by their sentiment (e.g. ‘:)’ = ‘emoticon_positive’) - URLs. Replace URL’s by a tag (e.g. ‘https://www.crisis.nl’ = ‘URL’)

- User mentions. Replace user mentions by a tag (e.g. ‘@brandweer’ = ‘@user’) - Hashtag removal. Removes hashtags from hashtag words (e.g. ‘#fire’ = ‘fire’)

- Slang removal. Replaces slang word by an uniform translation in formal language (e.g. ‘OMG’ = ‘o my god’)

- Expand acronyms. Expands acronym with whole word (e.g. ‘Lwd’ = Leeuwarden) - Repeated tokens. Limits sequence of repeated characters (e.g. ‘wooooooow’ = ‘wooow’) - Repeated words. Limits sequence of repeated words (e.g. ‘oh oh oh oh oh oh’ = ‘oh oh oh’) - Lexicon translation. Replaces document words by a class which is based on a lexicon

containing words with a prior class (e.g. ‘accident’ = ‘negative’)

- Stemming / Lemmatization. Reduce words to a common base form (e.g. ‘collapsing’ and ‘collapsed’ = ‘collapse’)

- Part of speech tag (POS). Distinct similar words with different POS (e.g. ‘drama’ = ‘drama_noun’ or ‘drama_adjective’)

(24)

15

2.4.5 Feature selection

The success of a machine learning method mainly depends on feature selection, which tells us how documents are represented (2.3.1). Kharde and Sonawane (2016) mention some commonly used features in sentiment classification:

- Term presence and their frequencies. The most standard application of this feature is based on raw unigram, bigrams and n-gram models. A more sophisticated way could be applied by transforming each term in a weighted vector representing the terms relevance. One commonly used weighted feature is tf-idf. This feature reflects the importance of a term in relation to its document in a corpus. To be more specific, it is a composite weight of term frequency (tf) and inverse document frequency (idf). Term frequency is the number of occurrences of a term in a document. Inverse document frequency is a score which depends on the number of documents in the corpus that contains a specific term, representing a high score if a term is rare and low if a term is frequent. This results in the following tf-idf formula which assigns a weight to term t in document d (Manning et al., 2008):

tf-ifd

t,d

= tf

t,d

x idf

t

- POS tags. Some POS (e.g. adverbs, adjectives, nouns and verbs) have proven to be good indicators of subjectivity and sentiment.

- Opinion words and phrases. Lexicons consisting of specific words and phrases with prior polarity or subjective classes can be a applied as features.

- Position of terms. This feature can define the relevance of a term for a whole sentence or document. The topic-focus structure (2.4.3) is a good example of this.

- Negation. The presence of this feature usually reverses the polarity (e.g. ‘niet goed’)

- Syntax. Some syntactic patterns (e.g. collocation) can indicate the presence of subjectivity in a sentence. For example, Kornek & Šimko (2014) make use of different kind of syntactic patterns as features to extract subjectivity. One of these patterns is the “if, then, else” pattern, which includes an opinion in the consequence part according to them (e.g. “Als de brandweer niet op tijd kwam, dan was de hele straat afgebrand”).

(25)

16

2.4.6 Algorithms and Performance

Table 3: Overview Accuracy Performance of Different Algorithms and Tasks

Our main goal is to classify crisis tweets in six classes based on two tasks (i.e. subjectivity and polarity classification) using a supervised ML algorithm. In 2.3.3 we discussed some important factors which influence the selection of an appropriate algorithm, among which data domain and domain knowledge. For this reason we will review some previous SA research and compare the performance of different supervised learning algorithms to select the best algorithm for our SA. In a survey about opinion mining and sentiment analysis Kumar Ravi and Vadlamani Ravi (2015) discuss, among others, subjectivity and polarity classification results of different studies. According to them subjectivity classification is often more difficult than polarity classification. With regard to the subjectivity classification they emphasise the study of Pang and Lee (2004, in Ravi & Ravi, 2015). This study is based on physical proximity between document (i.e. movie reviews) sentences, where a SVM and NB algorithm yielded an accuracy of 0.90 and 0.92 respectively.

Kharde & Sonawane (2016) perform a survey of techniques for a SA on Twitter data. Based on a literature review they selected three algorithms to carry out a polarity classification consisting of merely a positive and negative class. These algorithms are Naive Bayes (NB), Maximum Entropy (MaxEnt) and Support Vector Machine (SVM). Their accuracy results show that SVM (0.7668) outperforms NB (0.7456) and MaxEnt (0.7493). All models where performed with a feature set of unigram, bigrams and stopword removal.

In the paper of Barbosa et al. (2010) a two-step SA method is applied on four different algorithms: NB, SVM, MaxEnt and Artificial Neural Networks (ANN). The two-step method consists of a subjectivity classification which is performed first, whereafter a polarity classification is performed on the remaining subjective documents. Moreover, for each algorithm they included a unigram, bigram and hybrid (i.e. unigram + bigram) model. The comparison of all models showed the hybrid model performing the best for all algorithms. The accuracy results also showed SVM (0.88) outperforming all other algorithms again: NB (0.84), MaxEnt 0.83, and ANN (0.77). Table 3 summarizes the results of the different studies mentioned above.

Supervised Algorithms Subjectivity Accuracy (Pang & Lee, 2004)

Polarity Accuracy (Kharde & Sonawane, 2016)

Two-step SA Accuracy (Barbosa et al., 2010)

Naive Bayes 0.92 0.75 0.84

Support Vector Machine 0.90 0.77 0.88

Maximum Entropy - 0.75 0.83

(26)

17

3. Data

The methodological approach of our study is split in two parts: data collection method and data classification method. In this chapter we will discuss our data collection method by discussing our data domain and corpus selection, and explain our data filtering process. The output of this method will result in the input data of the classification-method.

3.1 Data domain: Crisis tweets Table 4: Dutch crises details

Dutch crisis Date and time

start of crisis

Casualties GRIP Type Tweets1

Geleen – Chemelot factory 09-11-2015 12:00 no casualties 4 F 3.541

Leeuwarden - Centre 19-10-2013 17:20 1 dead 2 F 17.040

Nijmegen - Senior apartment 20-02-2015 05:29 4 dead, 16 injured 3 F 1.527

Total – Fires 22.108

Alphen – Bridge 03-08-2015 16:15 1 dead dog, 1 injured 3 C 14.980

Leeuwarden - Balconies 21-05-2011 12:41 no casualties 3 C 1.099

Twente – Stadium 07-07-2011 12:17 1 dead, 15 injured 3 C 34.123

Total – Collapses 50.202

Raard - New year 01-01-2013 01:07 1 dead, 17 injured 2 T 1.671

Haaksbergen – Monster truck 28-09-2014 16:15 2 dead, 6 injured 3 T 9.694

Winsum - Train derail 18-11-2016 12:04 20 injured 2 T 2.601

Total - Traffic Accidents 13.966

Total - All Crises 86.276

Our main goal is to compare the classification performance of different kind of crisis data compositions, differentiating in amount of data and specificity. Therefore we decided to focus on different types of crisis, based on the disaster typology and GRIP method explained by Regtvoort and Siepel (2009). We restrict ourselves to three types of disasters: fire in big buildings (F), collapse of buildings (C); and traffic accidents (T). For each type we will collect the tweets of three Dutch crises with a GRIP-level of 2 or higher. In this way we aim to compose crisis data compositions of different sizes, differentiating three levels of specificity: a single crisis, all crises of one type together, all crises together. 4 shows the Dutch crises per type of crisis in more detail.

1

(27)

18

3.2 Data collection process 3.2.1 Filter relevant documents

The relevant documents about the Dutch crises are collected from the Dutch Twitter database of the RUG, by taking the following steps:

1. Extract all tweets that were sent within the timestamp (i.e. 3 days from the start) of the crisis 2. Filter the relevant (crisis) tweets using a standard query:

<place> AND ( <kind of crisis> OR <crisis object> OR <crisis verb> ) 3. Filter civil user tweets (based on a prior composed non-civil user list)

Step 1 consists of collecting all tweets that fall into the timestamp of the crisis. The timestamp of each crisis is set to a fixed amount of three days from the beginning of the crisis, considering that most citizens’ perception is expressed during this timestamp. This finding is showed in the citizens’ tweet frequency distribution graphs2 of three crisis types in Figure 3, 4 and 5.

Figure 3: Fire Leeuwarden Centre – Citizen’s Tweet frequency distribution of 3 days

Figure 4: Collapse Leeuwarden Balconies - Tweet frequency distribution of 3 days

2

(28)

19

Figure 5: Traffic Accident Raard New Year - Tweet frequency distribution of 3 days

In step 2 the relevant tweets about a crisis are filtered by using standard Boolean query compositions of different elements: place, kind of crisis, crisis object and crisis verbs. All elements in the queries are based on the information of (live)blogs about the crises. In this way we attempt to cover the major amount of relevant tweets. Finally in step 3 we filter the tweets of civil users, since we are only interested in the perception of citizens. For this we have used different online existing user lists3 as starting point to create a non-civil user list. Additionally we extended this list manually, inducing frequent tweeting non-civil users, and adding string patterns of usernames that are typically from non-civil users (e.g. ‘p2000’, ‘media’, ‘pol_’, ‘brw_’). All documents corresponding to a user matching one of the usernames or patterns mentioned above were excluded from our corpora. The amount of filtered tweets after step 2 are shown in Table 4.

3.2.2 Filter unique tweets

After collecting the relevant civil tweets for each crisis, we remained with nine datasets of different sizes including many duplicate tweets. The presence of duplicate tweets is mainly a consequence of the very common used retweet (RT) feature on Twitter. After all, one of the most important information needs during crisis is information sharing. The content of duplicate tweets do not add any information about the presence of certain sentiments in tweets. However, the frequency of duplicate tweets and sender information do. For example, it can give information about the impact of a certain sentiment or influence of sender. Since we only focus on the influence of the specificity of data and size of the corpora, we will only filter all repeated tweets without keeping track of the original user and amount of repetitions. Nevertheless, for future work these are possible features to consider. Anyhow, by filtering all duplicate tweets we remain only with unique tweets from which we randomly selected 700 tweets4 per crisis for manual annotation.

3 http://twittergids.nl/ includes a guide with different categories of twitter users. 4

(29)

20

3.2.3 Sentence split

Table 5: Overview of Crisis Corpora

Dutch Crisis Tweets Total

Sentences

Sentence per tweet Sentence per tweet (avg) Min Max

Fire: Geleen – Chemelot 700 1102 1 7 1.5

Fire: Leeuwarden - Centre 700 1181 1 6 1.7

Fire: Nijmegen - Senior apartment 700 1004 1 5 1.4

Fire: Total 2100 3287 1 7 1.5

Collapse: Alphen – Bridge 700 1113 1 7 1.3

Collapse: Leeuwarden - Balconies 700 1206 1 5 1.7

Collapse: Twente - Stadium 700 1094 1 5 1.6

Collapse: Total 2100 3413 1 7 1.6

Traffic Accident: Raard - New year 700 1160 1 5 1.6

Traffic Accident: Haaksbergen – Monster truck 700 1144 1 6 1.6

Traffic Accident: Winsum - Train derail 700 1038 1 6 1.5

Traffic Accident: Total 2100 3342 1 6 1.6

Crises: Total 6300 10042 1 7 1.6

The decision to split tweets into sentences derive from a preliminary research on our data for data operationalization purposes (4.2). One of our findings is that crisis tweets frequently consist of multiple sentiments, of which most are represented in single sentences. In other words, many tweets consist of multiple sentences, and most sentences include a single sentiment. Because of these findings we decided to apply a sentence-level classification, instead of a document-level classification. This would require to split all tweets into sentences as an additional pre-processing task, which is by the way a complex task that would deserve further research on itself.

For now we defined some rules which proved to be sufficient to split most of the tweets correctly into single class sentences. This rules were applied manually during the data annotation5, aiming to split the sentences as consistently and precisely as possible. Table 5 shows an overview of the crisis corpora that that is used as input for our classification system. An overview of the main rules to split tweets into sentences is summarized below. In this overview we make a distinction between sentence splitters and exceptions.

5

(30)

21 Sentence splitters:

1. A sentence ends after one of the following end of sentence indicators: - end of sentence punctuation (‘.’, ‘;’, ‘!’, ‘?’, ‘;’)

- URL

- emoticon

- a comma (‘,’) which is followed by ‘maar’ or ‘daarentegen’

-‘via’ followed by a mention (i.e. @user) or hashtag word (i.e. #word) - ‘by’ followed by an URL

- mention or hashtag word

- tweet number indicator at the beginning or end of a tweet (e.g.‘(1/2)’ or ‘1/2’) 2. An end of sentence indicator that subsequently is followed by another end of sentence indicator, ‘cc’, ‘(cont)’, hashtag word or mention, is part of the current sentence.

3. A new sentence begins with: - ‘RT’ (i.e. start of a retweet)

- alphanumeric character after an end of sentence indicator

- capital alphanumeric (except named entities) character after a hashtag word - mention followed by a colon (i.e. ‘@user :’).

Exceptions:

4. A hashtag or URL at the beginning of the (re)tweet is part of the following sentence. 5. Sentences within quotation marks must be split separately.

6. Single words within quotation marks are treated as part of the sentence. 7. Words within brackets are treated as part of the sentence.

Examples: Sentence Splitter

Tweet: Oei! :( Gelukkig geen gewonden, maar het huis is afgebrand! #brand #plaats

Sentence1: Oei! :(

Sentence2: Gelukkig geen gewonden,

Sentence3: maar het huis is afgebrand! #brand #plaats

Examples: Exceptions

Tweet: "http://t.co/vb Galerij #Leeuwarden ingestort" niet best! Sentence1: "http://t.co/vb Galerij #Leeuwarden ingestort"

(31)

22

4. Method

Let us refresh the focus of our study to better understand our methodological approach. Our main goal is to give a representation of the citizens’ experience during crises using Twitter posts. Because of the large amount of tweets which are sent during crises, we aim to achieve this by performing an automatic classification. In this we will compare the effects of specificity and corpus size to find out the perfect balance, since there is a trade-off between the two. To be more specific we will give a general overview of our method first, and explain the most important elements in more detail in the following sections. Moreover, an illustration of our method is depicted in Figure 6.

4.1 General Overview

We implement a two-step Sentiment Analysis (SA) consisting of an objectivity (i.e. objective, subjective) and a polarity (i.e. positive, neutral, negative) classification task on civil crisis tweets. Both tasks are performed on sentence-level, meaning that each tweet had to be split into sentences (3.2.3) before being classified. Consequently, each sentence will be assigned two classes: one Figure 6: Method General Overview

Civil Crisis Tweets Tokenization Manual annotation - Sentence split - Annotation Data Compositions Train set:

- Single (new) crisis

- Crises of 1 type excl. new crisis - Crises of 1 type incl. new crisis - All crises excl. new crisis - All crises incl. new crisis Test set: - New crisis Normalization - Textual - Crisis-info - Grammar Evaluation Performance metrics: - precision* - recall* - f1-score* * AVG 5-fold validation

Classification task & labels - Objectivity: objective / subjective - Polarity: positive / neutral / negative - SA: obj-pos/ obj-neu / obj-neg / sub-pos / sub-neu / sub-neg

5-fold cross validation Split: Tweet-level

Train: train set

(+ 80% new crisis)

Test: 20% test set

(i.e. new crisis) 20% of new crisis Feature Vectors TF-IDF n-grams: - unigrams - bigrams - trigrams SVM Algorithm Predictive Model train-sentences test-sentences Expected Labels

(32)

23

objectivity and one polarity class. Our SA differs from other studies in that we perform the polarity classification on both of the subjectivity classes. The union of both tasks result in six unique compounds that are considered as the final output classes6 on sentence-level.

These classes also serve as starting point for the manual annotation of the data and its operationalization (4.2). The process of manual annotation turned out to be complex and very time consuming. The main issues arise because of sentences containing multiple sentiments, and the lack of context and prosodic cues. To overcome these issues, we set up several additional operationalization and annotation rules. Moreover, a more simple operationalization process was performed for the tweet-level classification, since the classes of whole tweets are derived from the sentence-level classification.

After the annotation process, the data should undergo a preprocessing phase (i.e. tokenization and normalization). Especially the normalization process (4.4) is of particular importance for our study. That is because we focus on the effect of specificity and corpus size and the trade-off between the two. After all, we assume to improve performance by normalizing larger less specific corpora. That is the reason why we compare the effect of normalization on different data compositions (differentiating in specificity and corpus size).

The preprocessed data forms eventually the input of our system. A restricted review on SA studies showed us a decent performance of a SVM model on both of our SA tasks (2.4.6). It also proved that hybrid feature models combining different n-grams yield good results. The use of this features is in line with our study, as we perform a classification based on textual cues. For this reasons we will apply an SVM algorithm with TF-IDF n-grams (i.e. unigram, bigram and trigrams) as feature set.

Finally the evaluation of the system will take place by taking the average performance (precision, recall and f1-score) of a five-fold cross validation in which the data is split on tweet-level but trained and tested on sentence-level (4.5). In the following sections we explain this general overview of our method in more detail.

6 Output classes: objective-positive, objective-neutral, objective-negative, subjective-positive, subjective-neutral,

(33)

24

4.2 Operationalization and Annotation Process

We already mentioned the importance of the operationalization and annotation process (i.e. its output forms the input for our system). In this context it is crucial to define the classes as clearly and unambiguously as possible, so that the annotation can be performed as objectively and precisely as possible. We will mainly focus on the operationalization and manual annotation process on sentence-level, as our system is based on this level of granularity. Our tweet-level classification will subsequently be based on the compound of this output, by applying some operationalization rules. Consequently in this section we will first discuss the operationalization and manual annotation process on sentence-level, and then explain how the tweet-level classification is derived from the sentence-level classification.

4.2.1 Sentence-level

Table 6: Six way classification sentence-level examples

Sentiment Objective Subjective

Positive S1: RT @volkskrant: Geen slachtoffers na instorten brug in Alphen http://t.co/uCU63jBsI S1: De brandweer had de brand snel onder controle

S1: RT @PvdS: Respect voor de brandweer! #kelders http://t.co/ZCGUFbGGfu

S1: Goede communicatie door @gemeenteaadr over kraan incident.

S2: Snel en informatief! Neutral S1: Video: drone filmt ravage van bovenaf

http://t.co/ItqJlUyH5C

S1: RT @gemeenteaadr: Zorgen over de betrokkenheid van geliefden?

S2: Bel: 088 269 00 00.

S1: RT @BRW_HM: Instorting kraan #Alphen. S2: Er wordt gesproken over slachtoffers. S3: Woordvoerder is onderweg

S1: RT @gemeenteaadr: Zorgen over de betrokkenheid van geliefden?

S2: Bel: 088 269 00 00.

S1: RT @BRW_HM: Instorting kraan #Alphen. S2: Er wordt gesproken over slachtoffers. S3: Woordvoerder is onderweg

S1: WOW wat een verschrikkelijke foto’s! S2: Sterkte voor de betrokkenen.

S1: Jezus, wat een foto’s uit #Alphen S2: Hopelijk zijn er geen gewonden. Negative S1: RT @BRW_HM: Instorting kraan #Alphen.

S2: Er wordt gesproken over slachtoffers. S3: Woordvoerder is onderweg

S1: Brand Leeuwarden eist een slachtoffer op

S1: WOW wat een verschrikkelijke foto’s! S2: Sterkte voor de betrokkenen.

S1: Jezus, wat een foto’s uit #Alphen S2: Hopelijk zijn er geen gewonden.

The sentence-level operationalization and annotation process turned out to be very complex and time consuming for both tasks of the SA (i.e. objectivity and polarity task). The main challenge was to define a comprehensive and complementary set of distinct rules for all classes, in which each sentence would fit in one single class (i.e. one-to-one relation). Defining such operationalization rules required many subjective, overlapping and complex choices to be made. In the following sub

(34)

25

sections we will discuss the main rules and associated complexities of this processes. Apendix 1 summarizes all rules in the form of a codebook. The most common complexities are summarized in Apendix 2. Notice the distinction in the multiple sentiments category in this apendix: underlined and bold elements in the Cases column, correspond to the underlined and bold words in the Examples. However, before explaining all rules we will summarize the main classes of our study.

4.2.1.1 Main classes

In this paper we try to provide a classification, symbolizing the public experience during a crisis, that can be used for evaluation after a crisis. A distinction between objective and subjective documents seems to be already useful, since it results in a classification of the crisis communication primary objectives. However, an additional sub-classification would provide a more detailed citizens’ perception classification that goes beyond these objectives. Many Sentiment Analysis research has successfully combined objectivity with polarity (i.e. positive, neutral, negative) classification to reflect public sentiments. This fact inspired us to extend our objectivity classification with a polarity classification.

Even though in most Sentiment Analysis research the objectivity class is discarded when a neutral class is added, we preserve both classifications. In this way we aim to make distinction between documents that have a sentiment expressed in the message itself, and documents that tempt to provoke a certain sentiment without being expressed explicitly in the message. As far as we know, this results in an innovative six-way SA classification, which is performed in two steps combining two classification tasks: objectivity classification and a three way polarity classification. The union of those classes is considered as the final classification per sentence. Table 6 shows a couple of tweet examples for each class. In this table tweet sentences are depicted with a S followed by the sentence number (e.g. S1, S2, ...), and the tokens which are determinative for a sentence to be assigned to a class are in bold. Finally, a summary of the main classes and corresponding descriptions is given below:

Subjective: uncertainty, judgement, opinion or emotion explicitly expressed by the sender.

- Positive: positive sentiment

- Neutral: uncertainty or neutral sentiment

- Negative: negative sentiment

Objective: sharing facts, possibly including the author’s sentiment implicitly.

- Positive: fact that (eventually) causes a positive sentiment to the author - Neutral: neutral fact

Referenties

GERELATEERDE DOCUMENTEN

Some patients made remarks pertaining to the TTO method that offer insight into the acceptability of the method and the wide variation in scores (see

Flexicurity was developed as both an academic and policy concept from the general assumption and observation that trade-offs between flexibility (or efficiency) and security (or

H-2K b –FAP GNA PAL multimers were exchanged for selected pep- tides for 5 min at room temperature and subsequently used for staining of the H-2K b –restricted OVA 257–264

Entree vragen voor bezoek aan de Ecokathedraal elk jaar vele duizenden bezoekers ligt voor de hand, maar daarvoor zullen we niet kiezen.. Het open en publieke karakter is juist

- In 2015 “practically all” teachers of the participating schools are sufficiently able to provide differentiation in their teaching and deal with individual

Features of this time-delay cell include an accurately adjustable delay (by C), low delay variation versus frequency and an accurately controllable unity gain .The DC-coupled

For instance, if it is found that section 245(4) requires the court to look for some spiritual meaning beyond that obtainable from a normal purposive theory to

The corresponding output criteria of the framework (deployable units, decision-making information and measures) have been worked out in an annex for the national and regional