Harnessing disagreement in event text classiffication using CrowdTruth annotation

(1)

MSc Artificial Intelligence

Track: Intelligent Systems

Master Thesis

Harnessing disagreement in event text

classification using CrowdTruth

annotation

by

Maurits van Bellen

6148085

June 3, 2016

42 EC

(2)

Supervisors:

Dr. M. van Someren (UvA)

Dr. L.M. Aroyo (VU)

Dr. Z. Szl`

avik (IBM)

Assessor:

Dr. L. Dorst (UvA

(3)

Abstract

Gathering new expert annotated data for any machine learning task is a costly and timely approach. Using crowd-sourcing for gathering human interpretation on text, video and audio is a cost and time efficient alternative. There are many crowd-sourcing platforms available, for instance CrowdFlower or Amazon Mechanical Turk. However while experts have specific domain knowledge and incentive to annotate following a strict set of rules, this does not apply to the human annotators or Workers on the crowd-sourcing platforms. In ”CrowdTruth methodology, a novel approach for gathering annotated data from the crowd”, the authors state that human interpretation is subjective and thus having multiple people performing the same task on a crowd-sourcing platform will lead to multiple interpretations of this task. A common way of dealing with this problem is by using Majority Voting, take the annotation that has highest frequency and disregard all other annotations. However this removes the disagreement which is a signal, not noise. In this paper we introduce a novel way to incorporate the disagreement in the annotators as a way to weigh the features for textual data. We propose multiple novel algorithms to create the crowd-based distribution weights and discuss the pros and cons of each one. We test our weighted features with a modified Multinomial Naive Bayes algorithm. We show that using crowd based distribution weights we are able to perform as well and better than expert-labeled data while learning on only 10% of the training data. Furthermore our approach also has a much faster learning rate.

(4)

5 Experimental Setup and Results 22 5.1 CrowdTruth . . . 22 5.1.1 Spam detection . . . 22 5.1.2 Labeling . . . 24 5.2 Weighting . . . 25 5.2.1 Sentence Clarity . . . 25 5.2.2 Event-Sentence Clarity . . . 27 5.2.3 Crowd-TFIDF . . . 28 5.2.4 Majority Voting . . . 29 5.3 Naive Bayes . . . 29 5.3.1 Performance . . . 29 ii

(6)

5.3.2 Majority Voting Naive Bayes . . . 30

5.3.3 Crowd Weighted Multinomial Naive Bayes . . . 30

6 Discussion 33

6.1 Future work . . . 35

6.2 Conclusion . . . 35

Bibliography 40

(7)

List of Figures

3.1.1 Crowd Annotation Task . . . 11

3.1.2 Crowd Annotation Task Event Selection . . . 12

3.1.3 Crowd Annotation Task Event Attributes Selection . . . 13

5.1.1 Worker Metrics Distributions, showing the Worker Worker Agreement (WA) , the Worker Sentence Score (WSS) and the Average Annotations per Sentence per Worker (A/S) . . . 23

5.1.2 Metrics Distribution with spammers . . . 24

5.1.3 Metrics Distribution without spammers . . . 25

5.2.1 Histogram of Sentence Clarity Score on train set . . . 26

5.2.2 Histogram of Sentence-Event-Clarity Score on train set . . . 28

5.2.3 Histogram of Crowd TF-IDF Score on train set . . . 29

5.3.1 Learning Curves . . . 31

5.3.2 Learning Curves with weights . . . 31

5.3.3 Spammer Learning Curves with weights . . . 32

(8)

List of Tables

3.1.1 Vectorized annotations of sentence . . . 15

5.1.1 Ten crowd annotations for ”The quarter was terrible , and the future looks anything but encouraging . ” . . . 23

5.1.2 Spammers in pilot . . . 24

5.1.3 Context Words Pos Tag Frequencies . . . 25

5.1.4 Context word Weights . . . 25

5.2.1 Sentence Clarity Statistics . . . 26

5.2.2 MeanShift Clusters and words . . . 27

5.2.3 Sentence-Event-Clarity Statistics . . . 28

5.2.4 Crowd TF-IDF Statistics . . . 29

5.2.5 Events annotated . . . 30

5.3.1 Majority Voting Accuracy comparison . . . 30

5.3.2 Accuracy of MNB trained with CrowdLabels and weights . . . 31

(9)

Chapter 1

Introduction

1.1 Introduction

1.1.1 Event recognition in Natural Language Programming

Event recognition is one of the major tasks within Information Extraction. Named entity recog-nition [11] has opened up the field to access by language independent content, allowing identifi-cation of loidentifi-cations and names. Event recognition is a core Natural Language Processing(NLP) task, used in for many applications such as summarization [38] and Question Answering [33]. Many models have been proposed for analysis and recognition of the different forms events can take in natural language. [40].

In NLP the definition of an event is often dependent on the target application. The aim of this kind of task is to cluster documents on the same topic or event. Information Extraction tasks take a different approach. This field proposes standard schemes to annotate individual events in a document according to a predefined set of rules for a given domain (e.g. Political Conflict) [13]. More recent schema’s also capture temporal ordering of events [41].

As research in Information Extraction has shown, Machine learning can be a valid approach to extract information such as persons, entities or events from text [14]. Typical Machine Learning Information Extraction problems are solved by gathering annotations on a given corpus, transforming this corpus to features using various methods such as Bag of words, tree-parsing (which evaluates grammar in a given text), POStagging (which assigns parts of speech to each word such as noun or verb) etc. And training a classifier to extract the information from text by supplying a train dataset to learn on.

Frequently used algorithms for event text recognition include Naive Bayes classifier [24], SVM [7] and CRF [22]. McCallum et al., show that the Multinomial Naive Bayes Model is one of the better algorithms for NLP event classification. A Multinomial Naive Bayes Model captures word frequency information in sentences by modeling the distribution of words in a document as a multinational [24].

(10)

Chapter 1: Introduction 2

1.1.2 Datasets

One of the well-known bottlenecks in data-driven NLP research is the lack of sufficiently large datasets. As the algorithms for classification depend on enough information to extract from these datasets. Datasets need to be labeled by one or multiple experts (in the domain of the dataset) following a set of rules, in the case of multiple experts with different labels a consensus about the label to be used must be reached. [39]

1.1.3 Crowd-sourcing

Using crowd-sourcing for gathering human interpretation on text, video and audio is a cost and time efficient alternative. By allowing human non experts to annotate a given task such as text, video or audio for monetary gain. A sourced approach often happens on a crowd-sourcing platform such as CrowdFlower or Amazon Mechanical Turk. However while experts have specific domain knowledge and incentive to annotate following a strict set of rules, this does not apply to the human annotators or Workers on the crowd-sourcing platforms. While you could demand workers to annotate following the same set of rules this is often to time consuming for a worker to consider doing this task leading to a lower participation rate. This removes the benefits of using crowd-sourcing as an alternative to expert annotations. The different expertise level of experts and crowd can not be ignored, expert labeled data will be of higher quality, however take more time and are more costly. How well the intended message in a text is recognized depends on the background knowledge of the domain [25]. Crowd-annotated data is more likely to have different annotations for the same natural language elements, due to multiple interpretations. In [4] the authors state that human interpretation is subjective and thus having multiple people performing the same task on a crowd-sourcing platform will lead to multiple interpretations of this task. They call this methodology CrowdTruth.

A common way of dealing with multiple annotations for the same task is by using Majority Voting. Take the annotation that has highest frequency and disregard all other annotations, the same can be applied to crowd-sourcing [16]. Alternatively annotators can be forced to form a consensus [21] [2]. In Truth is a Lie [5] the hypothesis is that annotator disagreement is signal and not noise and proposes to use the ”disagreement” distribution in learning. our goal is to build a classifier on the crowd annotated noisy data which can perform as good or better as a classifier trained on an expert annotated dataset with little to no noise.

1.2 Problem statement

The data that can be gathered with CrowdTruth contains numerous examples of disagreement, this disagreement is often disregarded in machine learning but may contain information that can be used to improve training. However removing this disagreement also removes information from the data, when learning on single label data only we are forced to disregard that this data

could also be used as an example of another label. We are looking to explore the

trade-off between many noisy annotations and relative few annotations with less to no noise. Our hypothesis is that crowd sourced data annotations returns annotations with ambiguity, however after prepossessing this ambiguity can be used in training a classifier, leading to less data required to train a model. However as we are dealing with a crowd-sourced approach that does not guarantee high quality labels we need some measure of reliability of each label for

(11)

3 1.3 Research Questions and Objectives

every example. So before we can use these Crowd annotations we need to tackle the following problems

• Spammers in crowd annotations give a false representation of the labels.

• Metrics used in CrowdTruth approach can provide a measure of reliability of each label but are designed for relations, not for sentences (or events).

• Distribution of labels need to be incorporated in learning.

1.3 Research Questions and Objectives

• Can we gather a distribution over labels using crowd-sourcing ? • How can we measure the reliability of the labels ?

• How can we prepare a distribution of labels for learning without losing important infor-mation ?

• Does using this distribution over labels improve accuracy or learning speed of a classifier ?

1.3.1 Approach

We will show how Multinomial Naive Bayes can be weighted using crowd annotated data. To gather the crowd annotations, we apply the CrowdTruth methodology to create a crowd-sourced task for event detection in NLP where there is maximal potential for disagreement. A way to capture the reliability of each label is to use the CrowdTruth metrics. Storing the annotations of the crowd in an annotation vector. This allows for using cosine similarity of these vectors as a similarity measure which serves as the basis of the reliability for each label. We propose and implement changes to the CrowdTruth metrics to remove penalizing sentences longer than seven words. To create the weights for the crowd annotated data, we propose a number of distribution based learning approaches and discuss the pros and cons of each one. As our hypothesis is that we can use these weighted crowd-annotations as a cheaper and more time efficient alternative to expert annotations we want to evaluate our approach by comparing these methods to Majority Voting and by comparing Crowd Majority Voting to Expert Majority voting based on accuracy and convergence speed where we keep the sizes of the data sets equal. Lastly we compare the Expert Majority voting to Crowd distribution methods.

1.4 Outline

Chapter 2 gives an overview of the related research regarding Crowd annotations and related algorithms. Chapter 3 will give an overview of the theory behind these algorithms and explain what kind of adaptations will have to be made to use the new data. Chapter 4 will discuss the approach taken in this thesis. In chapter 5 the experiments are explained and illustrated and in chapter 6 the results will be presented and discussed. In the final chapter, chapter 7 the conclusion of this thesis will be presented and additional research paths will be proposed. Any additional information can be find in the Appendix.

(12)

Chapter 2

Related Work

In order to provide context to the Research Questions and Objectives we will give a review of the important concepts. As our approach requires crowd-annotated data we start with an overview of crowd-sourcing approaches. Then we discuss the related work in preprocessing crowd-annotated data to give a weight to each individual annotation and remove spammers. Lastly we discuss Event Classification in NLP, feature, algorithms, evaluation and feature se-lection.

2.1 Crowd-sourcing

Crowd-sourcing is defined as outsourcing a job that was previously done by an expert to an large group of people of various, unknown skill level in Howe et al. [17].

Crowdsourcing as a Model for Problem Solving [9] describes how a group or people can be more intelligent then the smartest among them and calls this Crowd Wisdom. This Crowd Wisdom allows crowds to perform as well or better as experts on a given domain. Crowd-sourcing Platforms aim to provide a way of extracting this Crowd Wisdom.

There are many example of crowd-sourcing Platforms in todays world, for example Amazon Mechanical Turk www.mturk.com is a marketplace for work where developers can write Human Intelligence Tasks (HIT) that will be fulfilled by numerous workers for monetary gain. In [19] researchers find that 90 % of the HITs pay $ 0.10 or less.

Rosenberg et al. [32] takes a different approach relying on online groups to work together to make a decision.

While this approach is able to predict real world outcomes better and with less users then a poll-based approach it demands that the possible options are known in advance thus making it difficult to use for open ended questions. Another example is CrowdFlower www.crowdflower.com. Crowdflower allows data gathering from different channels such as MTurk [8]. While crowd-sourcing is a cost effective alternative for annotation gathering, the skill level of workers can not be determined beforehand. Meaning while the speed of annotation gathering goes up and the cost goes down, the quality of the annotations is not guaranteed. The trade-off between crowd or expert gathered annotations is between many noisy but cheap annotations and relative few more expensive but less noisy annotations.

(13)

5 2.2 CrowdTruth

2.2 CrowdTruth

The CrowdTruth framework provides a way of measuring the reliability of (crowd) annotated labels. This framework proposes a set of metrics that can capture different aspects of this reliability.

2.2.1 Metrics

The CrowdTruth papers [5] [18] [4] describe various metrics that are designed for relation ex-traction in written text. These metrics aim to capture the reliability of a label. Annotations for a given sentence gathered using the CrowdTruth approach are stored using a vector

represen-tation Vs. This representation allows calculation of the cosine similarity between two vectors.

the cosine similarity returns how similar two vectors are on a scale of 0 to 1 where 0 means that the vectors are independent. Furthermore by using crowd sourcing we ensure that each word has an event or non event label.

cos(v1, v2) = dot(v1, v2)

|v1||v2| (2.1)

dot(v1, v2) is the dot product between two vectors:

dot(v1, v2) =

n

X

i=1

v1iv2i (2.2)

However these metrics are designed for relation correlations, thus only having one on one relationships. In this section we will touch on the metrics that we use in our approach.

Worker-Sentence Score

Worker sentence score is a measure of the quality of the annotations of a given worker for a

given sentence.(2.3) where Vs,i is the worker-sentence score, the annotations of worker i on

sentence s as a vector and Vsis the sum of all annotations of all workers for a given sentence s

wws(wi, s) = cos(Vs− Vs,i, VS) (2.3)

Worker-Worker Agreement

Worker-worker agreement is a measure of the average agreement of two workers over all shared sentences between two workers. It the agreement in number of annotations in common for

sentences annotated by both (2.4), where Uwi,wj is the set of units annotated by both worker

wi and worker wj. For our annotation we took each word in a span as a separate annotation

as comparing two similar spans would give a lower then intended score.

wwa(wi, wj) =

P

u∈U_wi,wjP Vu,wi· Vu,wj

P

u∈U_wi,wjP Vu,wi

(14)

Chapter 2: Related Work 6

Number of annotations per sentence

Is a worker metric indicating the average amounts of events per sentence, for our implementa-tions we took both a single word and single span as an event. This is due to our crowd task allowing the segmentation’s in spans, not in words

Nr of annotations per worker per sentence

Total annotations per sentence (2.5)

2.2.2 Sentence Relation Score

Sentence Relation Score is the probability that a given sentence expresses a given relation (3.1)

, where Rr is the unit vector for a given relation

srs(s, r) = cos(Vs, Rr) (2.6)

2.2.3 Sentence Clarity

Sentence Clarity is the maximum relation score expressed in a sentence as defined in (2.7)

sc(s) = maxR(srs(s, r)) (2.7)

These metrics fail to give a correct score to a complex sentence with multiple event where annotators might select different spans for the same event but are in agreement about the events in a sentence. We propose change in the implementation of these metrics to ensure that longer sentences are not punished for having more options for disagreement.

2.2.4 Spam Detection

A problem with crowd-sourced approaches is the possibility of spammers. Spammers introduce noise by purposely adding false labels, this can be helped by removing the spammers or discour-aging workers to spam in the task. In Hirth et al. [16] the authors propose a spam detection method based on Majority Voting. However this removes the possibility of using the disagree-ment, Dawid et al. [12] shows that Expectation Maximization can be used to estimate the error rate of individual annotators, but it assumes that there is one correct annotation. Soberon et al. [35] show that a single task can have multiple good answers and propose multiple metrics to capture spammers based on worker annotations. Aroyo et al. [3] show how a combination of these metrics can be used to identify low quality workers and how the Sentence Clarity metric can be an indication of how much of a given class is expressed in a given training example.

2.3 Event Classification

In event classification the task is to classify a given word, span, sentence or document as being and event or not and event. The definition of an event tends to change depending on the task, as the term has some inherit ambiguity. This makes event classification a good candidate to test our approach.

(15)

7 2.3 Event Classification

2.3.1 Features

In the TempEval-2 challenge the goal is to find and classify events in an annotated dataset. The best performing teams used a combination of Syntactic Features (POS Features ,BoW , Parse Features, Named Entity Recognition) and linguistic features( Stemming , WordNet Synsets, WordNet Lemma’s) and stop-word removal. Syntactic Features are features that can be used to describe the syntax of a target word or phrase. This syntax is the way in which words are put together to form phrases or sentences and the part of grammar dealing with this. Linguistic Features help identify the intended sense of a target word. In the following subsections we will discuss the features we have chosen and how we have extracted them.

PoS

Each word in a sentence may be classified into different syntactic classes such as nouns, verbs and adjectives. These classifications are known as the parts of speech (POS). Assigning these classifications to the words in a sentence is known as part of speech tagging. For example the following sentence:

Henry will host the party (2.8)

We can PoS tag this sentence as shown in (2.9),where NNP stands for proper noun, MD stands for modal, VB stands for verb, DT stands for determiner and NN for noun.

Henry/NNP will/MD host/VB the/DT party/NN (2.9)

The PoS tag of a sentence is dependent on the context it is used in. Once all the words in a text have been marked with their appropriate classifications we can feed this text as input to a classifier which will take into account the PoS tags.

These PoS tags can be assigned automatically by using a pos tagger. PoS taggers usually use a directed graphical model such as Hidden Markov Model or a Conditional Markov Model

to predict the tag of a word based on the tags in front of this word as defined in 2.10 where ti

references the ith tag.

P (t, w) =Y

i

P (ti|ti−1, wi) (2.10)

We are using the Stanford Log-linear Part-Of-Speech Tagger [37]. Bag of Words

The Bag of Words model is a simplified representation of the words found in a text. Instead of using the actual words in a text a vector is created representing these words and their counts in the text. This technique is often used in Information Retrieval and Natural Language Processing but can also be applied in image recognition. It has to be noted that Bag of Words does not take word ordering into account. [34]

Parse Features

Parse Features are features that come from parsing the text as a tree. These features can be created when using a Tree-structured model. Tree-Structured models compose a phrase and

(16)

Chapter 2: Related Work 8

sentence representation from its sub-phrases according to a syntactic structure. These can be formed by Long Short Term Memory Networks (LSTM) [36]. A tree structured model does not lose word order information.

Named Entity Recognition

Named entity Recognition is part of Information Extraction. It consists of two distinct prob-lems, detection of named entities and classification of the type of entity (e.g. Person, Location , etc). There are many Automatic Named Entity Recognizers that can reach human like per-formance levels [27].

WordNet

WordNet is a large lexical database of English where pos tag instantiations of words are grouped together into sets of similar meaning (Synsets).

Synsets are defined as a set of synonyms that share a common meaning. for example the words, board, plank and board, committee can serve as unambiguous designators of these two meanings of board. [26] Lemma’s are defined as the set of words that make up a Synset.

These Synsets and Lemma’s can be used to accelerate the learning process of a classifier.

2.3.2 Algorithms

Frequently used algorithms for event text recognition include Naive Bayes classifier [24], SVM [7] and CRF [22].

McCallum et al. show how the Multinomial Naive Bayes Model is one of the better algo-rithms for NLP event classification. A Multinomial Naive Bayes Model captures word frequency

information in sentences as counts. [24]. Variations on Multinational Naive Bayes include

Transform-Weight Complement Naive Bayes and Transform-Weight Normalized Complement Naive Bayes [30] which are weighted models. Furthermore, using TFIDF scores rather than raw frequency counts is show to improve Multinomial Naive Bayes performance and can reach performance comparable to SVM [20]. We will show how Multinomial Naive Bayes can be weighted using crowd annotated data.

2.3.3 Classifier evaluation

Classifiers that are trained on a train set can be evaluated using a supplied test set. This evaluation is based on how many data points in the test set have been correctly classified. A data-point is considered a True positive (T P ) when it has been classified as positive and the supplied test set label of that data-point is also positive. If the data-point is classified as negative while the test label is also negative it is considered a True Negative (T N ). If the data-point is classified as positive but the test set label of that data-data-point is negative it is considered a False Positive(F P ). If the data-point is classified as negative but the test set label of that data-point is positive it is considered a False Negative(F N ). The accuracy of a classifier is computed by dividing the total number of correctly classified data points by the total number of data points:

Accuracy = T P + T N )

(17)

9 2.3 Event Classification

More commonly used metrics for evaluating classifiers are Precision , Recall and F1-score [28].

Precision is the percentage of positive data points that are correctly classified.

Precision = T P

T P + F P (2.12)

Recall is the percentage of positive data points that are correctly classified.

Recall = T P

T P + F N (2.13)

F-score is a measure that combines precision and recall where they are equally weighted.

F1-score = 2 · P · R

P + R (2.14)

2.3.4 Feature Selection

One of the problems when using text based features is the sparsity of data as text follows Zipf’s law, the rth most frequent word has a frequency f(r) that scales according to

f (r) ∝ 1

ra (2.15)

for 1 [42]. The most frequent word (r = 1) has a frequency of ∝ 1, the second most frequent

word (r = 2) has a frequency of ∝ ₂1a. This leads to the creation of noisy features based on

words that only occur once in a text or corpus. Using BoW features can result in very efficient text classification, the limitations of such an approach such as the amount of zero values and multiple words having the same meaning can be mitigated by feature reduction algorithms to make a more abstract representation of the data. [1] In Rogati et al. [31] the authors discuss the two common feature selection methods. Information Gain (IG) and Chi Squared(CHI).

The Information Gain of the feature tkover the class ciis the reduction in uncertainty about

the value of ci when the value of tk is known.

The Chi Squared Test for feature tkover the class ciis the statistical likelihood that feature

tk and class ciare dependent. In both methods features are given a score, if a the score is above

a certain threshold the feature is relevant for the class. Rogati et al [31] show how CHI2 can outperform IG

LDA is another common feature reduction method. It’s strength lies in projecting the data onto new axis which maximize the variance. This allows it to identify redundancy based on class membership. [6] However, LDA and other statistical learning methods suffer from the small sample size (SSS) problem, when the number of training samples available is smaller then the dimensionality of the sample space. In this case the sample scatter matrix may become singular

making the computations of the eigenvalues and eigenvectors of S−1_W SB impossible. [29]. As we

(18)

Chapter 3

Method and Approach

Our goal is to create a classifier that uses a crowd- annotated distribution of labels. Before we are able to create this classifier we need to gather and preprocesses the data, dealing with spammers, reducing noise and giving a weight to each individual annotation. Next we have to select features for our algorithm to use and prepare the algorithm to use the distribution over labels.

3.1 Data Acquisition

3.1.1 Crowd Task

Following the CrowdTruth methodology we allowed workers far more freedom than experts. We split our corpus into batches of 30 sentences, for each batch we created an annotation task. In our annotation task workers were asked to select all events in tasks, but we allowed them to select spans (phrases) rather then words. This should allow for more inter annotator disagree-ment. As seen in 3.1.1 and 3.1.2 the crowd only had limited restrictions when annotating the data, an event consists of 1 or more words in the sentence, the event has to be selected in one continues span and up to 30 events can be selected in a given sentence.

To reduce spammers we required workers to take at least 30 seconds on each sentences. If no events were found after 30 seconds we asked the workers why no events are in this sentences, a minimum of 15 characters was required as explanation. It is important to note that for the rest of our experiments this explanation was irrelevant was merely in place as an extra barrier against spammers. To ensure that the metrics of CrowdTruth [5] could be used we let every sentence be annotated by 15 different workers leading to 450 judgments every task. Based on previous experiments within the CrowdTruth team the price was set at $ 0.05 per labeled sentence. Furthermore we only allowed native English speakers to complete the task as a way to reduce possible spammers. This can be done via the CrowdFlower platform

We ran an initial pilot to find the optimal number of sentences that can be given in a single task while still finishing in a reasonable amount of time and divided the training set in batches accordingly. Due to time constraints we only ran 230 sentences.

(19)

11 3.1 Data Acquisition

(20)

Chapter 3: Method and Approach 12

(21)

(22)

3.1.2 Segmentation

Segmentation is a problem. As the expert had a much more restricted set of instructions the segmentation between the crowd and the expert annotations are vastly different. In our case we are dealing with two different kinds of segmentations, first we have the expert segmentations, who have decided on a word level whether or not a word in an event, as instructed. Second we have the crowd who has the option to create it’s own segments. This creates a big problem for our classifier, as the input required output are vastly different, meaning any comparison between the two becomes troublesome. As a solution we have segmented our sentences in our preprocessing step as the expert would, on word basis. However, during preprocessing we made sure that these word-segments contain the features, weights and sentence-event-clarity of the crowd-segments, which could be much larger.

3.1.3 Spam detection

We use the CrowdTruth worker-worker-agreement , worker-sentence-score and num-ber of annotations per sentence as the parameters for spam detection. If the value of these number of annotations per sentence is higher than a given threshold, and textbfworker-worker-agreement , worker-sentence-score are below their thresholds, a worker is flagged as a spammer.

We ran initial experiments on the pilot set to determine the thresholds for our spam detection and to find the minimal number of checks that had to be passed. We labeled spammers in our initial pilot set by hand and estimated the thresholds for each parameter based on these results.

3.1.4 Weighting

We want to incorporate the uncertainty of a given label while training the classifier. We

propose to estimate this uncertainty by adapting the CrowdTruth metrics and incorporating this estimates as weights for a given label. As we are not looking for relations between words or events, we had to adapt the Sentence Relation Score and the Sentence Clarity. Instead of using the unit vector for a given relationship to calculate the sentence relation score we used the unit vector for a given word. we call this adaptation the Sentence Event Score. We then use the Sentence Event Score to calculate the Sentence Clarity

Sentence Event Score

Sentence Relation Score is a probability of a given sentence expressing a given relation (3.1) ,

where Eeis the unit vector for a given event

ses(s, e) = cos(Vs, Ee) (3.1)

Sentence Clarity

Sentence Clarity is the maximum relation expressed in a sentence (3.2)

sc(s) = maxE(ses(s, e)) (3.2)

(23)

Sentence Clarity

The sentence clarity is defined as the maximum of the sentence-relation-score, which is the cosine of the unit vector in relation with the sentence vector [5].

We found that this metric does not account for the possibility that annotators would select multiple spans for the same event. This creates a bias towards shorter well defined sentences. Thus penalizing longer more complex sentences for having more possible options to select the same event, while the meaning of these options refers to the same event.

Sentence But in the past three months , stocks have plunged , interest rates have soared and the downturn all across Asia means that people are not spending here WorkerVector1 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 WorkerVector2 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0 1 1 1 1 1 WorkerVector3 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 WorkerVector4 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 1 0 WorkerVector5 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 WorkerVector6 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 0 0 0 0 0 1 1 1 1 1 WorkerVector7 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 WorkerVector8 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 WorkerVector9 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 WorkerVector10 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 WorkerVector11 0 0 0 0 0 0 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 0 WorkerVector12 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 WorkerVector13 0 1 1 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 WorkerVector14 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 Sentence-Vector 0 2 2 5 4 4 1 8 10 11 1 9 9 10 12 2 3 6 6 6 6 0 0 4 4 5 7 4

Table 3.1.1: Vectorized annotations of sentence

These annotations can be clustered in the following crowd annotated events in this sentence: 1. interest rates have soared

2. stocks have plunged

3. the downturn all across Asia 4. people are not spending here 5. in the past three months

These clusters reflect phrases that the crowd has indicated as an event.

Furthermore while a sentence can be clear an individual event in this sentence might not be.

We propose two new algorithms to calculate the clarity of an event in a sentence, taking into account that clusters of annotations can refer to the same segment.

Sentence-Event-Clarity-Score

The Sentence-Event-Clarity-Score is a score representing how clearly an individual event has been expressed in a sentences by looking at the agreement of the annotators for that event. The algorithm:

1. Vectorize the worker-annotations, use the word-index as key for each word

2. Cluster the worker-annotations using MeanShift, an algorithm that shifts each data-point to the average data-points in its neighborhood [10]

(24)

4. For every worker-annotation calculate which cluster this annotation belongs to by taking the max, a worker-annotation can belong to multiple clusters.

5. Calculate the Sentence-Event-Vector as the sum of all worker-annotations clusters 6. Calculate the Sentence-Event-Clarity per cluster by taking the cosine between the unit

vector and the Sentence-Event-Vector Crowd TF-IDF

Our second proposed metric is Crowd TF-IDF an extension on TFIDF

T F IDF (w) = log(f + 1) ∗ log(D

df). (3.3)

This metric should reflect the importance of a word for event classification given a crowd annotated corpus. We calculate the Crowd Term Frequency of this word by dividing the number of times a word has been annotated by the crowd as event in the given sentence and dividing this by the total number of possible times it could have been annotated as an event in the given sentence, which in our task is the total number of works for this given task. The Crowd Inverse Document frequency is defined as one divided by the log of the number of event annotations for all words in all sentences in this task divided by the total number of annotations for this word in all sentences in this task

3.1.5 Feature Selection

We selected the following features: • Part of Speech tags

• Modalities • WordNet Synsets • WordNet Lemmas Part of Speech Tags

For our Part of Speech Tags we used the Stanford PCFG Parser Accurate Unlexicalized Parsing, giving us a total of 34 possible tags per word. The POS tags were extracted by parsing an individual sentence and extracting the POS tags on word level. Further more to preserve the context of this word in a sentence we used a sliding window approach with a range of two, meaning we added the POS-tags features of the two words before and after the word.

(25)

17 3.2 Feature Selection

Modalities

The modalities are defined as a set of words like can, might maybe etc. The modalities features are a representation of the modalities we encountered in a given window.

WordNet Synsets

On our training set we extracted all the WordNet Synsets. Synsets are defined as a set of synonyms that share a common meaning. for example the words, board, plank and board, committee can serve as unambiguous designators of these two meanings of board. [26] We only extracted the Synsets on our training set. [15]

WordNet Lemmas

The words that make up a Synset are called lemmas. We extract for each SynSet in our

training set all the Lemmas. Furthermore we vectorize these Lemmas. The WordNet Synsets and Lemmas help us deal with the sparsity of the data as we are also training for all the synonyms and lemmas of the words even though we have not encountered them. [15]

3.2 Feature Selection

For feature selection we use the Chi Square algorithm. In statistics the chi square test can be used to test the independence of two events. In feature selection we can use chi square to calculate if the occurrence of a feature and class are independent. We can estimate this as follows: ˜ χ2= 1 d n X k=1 (Ok− Ek)2 Ek

A high chi square score indicates that the hypotheses of independence (null hypotheses) should be rejected and the class and feature are dependent on each other. If they are depended we want to use these features for our classifier, if not we want to remove them.

Using this method we only want to selected the features that have a test score larger then 10.83 indicating statistical significance > 0.01. As Manning et al. [23] have shown we do not need to use Yates correction as the number of independent features that get selected by the chi square method do not have a significant impact on our classifier.

3.2.1 Algorithm

Consider a word d, whose class is given by c. If we look at one of the best-know current text-classification problems as an example, we end up with spam detection. For spam detection we can consider a document d whose class is given by c. There are two possible classes spam and not spam. We then classify d as the class that has the highest posterior probability P (c|d). In our research we are not trying to classify documents but rather individual words. Thus we can rewrite our P (c|d) as P (c|w). We can rewrite this following Bayes’ Theorem:

(26)

P (c|w) = P (c)P (w|c)

P (w) (3.4)

The class prior P (c) can be estimated by dividing the number of occurrences of that class by the total number of words. The probability of P (w|c) is defined as the frequency of word w

labeled as class C divided the by total frequency of word w. Nwc is the number of occurrences

of the word with the class and Nwt is the total number of occurrences of the word.

P (w|c) = Nwc

Nwt

(3.5)

As we represented our words as a set of features our Nwc is actually the number of times we

have seen each feature of this word labeled as this class and Nwt is the number of times we

have seen each feature per word total.

To incorporate the crowd disagreement we have to introduce a weighting term to the

likeli-hood denoted by ωc, these likelihoods are used by the Multinomial Bayes algorithm to estimate

the class of a given feature set.

P (w|c) = Nwc Nwt ωc (3.6) We can rewrite 3.7 as P (c|w) =P (c) Nwc Nwtωc P (w) (3.7)

The sum over c of P (c|w) is equal to 1

Data: a set of Classes C and a set of documents D Result: V ,prior,condprob V < − Extractvocabulary(D); N < − CountWords(D); for c ∈ C do Nc← CountWordsInClass(D,c); prior[c] ← Nc N ; for w ∈ V do condprob[w][c] =P (c) Nwc Nwtωc P (w) ; end end

return V, prior, condprob

Algorithm 1: Train Multinomial

3.2.2 Multinomial Naive Bayes

Multinomial Naive Bayes ( MNB ) models the distribution of words in a document as a multi-nomial. Our modified MNB models the distribution of labels for a reduced feature-set as a

(27)

19 3.2 Feature Selection

Data: a set of Classes C and a Vocabulary V , the class priors, the conditional probabilities, document to be classified d

Result: predicted class c

W < − Extractwords(d) for c ∈ C do score[c] ← logprior[c] ; for w ∈ W do score[c]+ = logcondprod[w][c]; end end

return arg maxc∈Cscore[c]

Algorithm 2: Test Multinomial

multinomial based on the weighting function. We train the modified MNB on the extracted features ??. Thus our classifier learns to evaluate individual words based on a distribution of labels. Furthermore we use our proposed weighting schemes to try to improve the performance of our classifier.

Our modified MNB uses the same predicting algorithm as a normal MNB ??. However we’ve made changes to the training algorithm, our conditional probability for a given label and token pair is dependent on not only the number of occurrences of a token in a document but also weight of this label ?? instead of being depended on the just relative frequency [23] .

(28)

Chapter 4

Dataset

We are now going to apply our approach, as input for our crowd-task we use the following dataset. This dataset was specifically chosen as it targets event recognition in text. The experts have used a very restrictive set of rules for annotation, when removing these restrictions there is a lot of possible disagreement in this dataset this disagreement will the basis of the distribution in our model. The output of the crowd-task will be input for the metrics and spam detection. Based on the remaining labels, after the spam-removal has removed spam labels, we will build models based on the weighting schemes.

4.1 TimeML

We used the tempeval2 from http://timeml.org/tempeval2/. This dataset consists of six lan-guages where we have only looked used english. For English this dataset has 53.000 tokens consisting of words in documents from newspapers such as WallStreet Journal and ABCnews.

The corpus contains the following data: 1. sentence boundaries

2. the document creation time

3. all temporal expressions according to TimeML TIMEX3 tag 4. all event to TimeML EVENT tag

5. main event markers per sentence 6. all temporal relations between events

The labels given to to the words are generated by a majority voting approach of four (natural language ) experts.

In our Experiments we used a subset of 2175 English training sentences. The test set

consisted of the full supplied English test set.

For our experiments we only used the English parts of the dataset. As we want to use the crowd

(29)

21 4.1 TimeML

to extend these annotations with different points of view we’ve created a Crowd Annotated Dataset

4.1.1 Crowd Annotated Dataset

We gathered this dataset trough the CrowdFlower platform. Workers were asked to annotate up to 30 sentences per task, previous work and our pilot showed that when more sentences where added to the task it would not finish, and we ran eight tasks over a period of two months. This gave us our Crowd Annotated dataset.

• 230 annotated sentences • 2175 sentences total

• 17.3 avg annotations per sentence

If we compare the crowd based dataset with the expert based dataset we get some interesting statistics. The Crowd annotated dataset has an average of 17.3 annotations per sentence, while the expert dataset has an average of 1.7 this is due to the majority voting that was used for the expert labels. The expert labels are single words with no distribution over the sentence, while our crowd annotated data has a clear distribution of events per sentence. Furthermore we have ended up with a relatively small crowd annotated dataset consisting of only 230 annotated sentences. However each sentence has on average 17.3 annotations meaning our classifier can exploit the distribution of these annotations to still learn effectively. Lastly the crowd annotated dataset has clearly annotated in phrases, not single words. This created the need for wordlevel segmentation on our crowd annotated dataset.

(30)

Chapter 5

Experimental Setup and Results

In this chapter we show the results of our approach. As we’ve taken a layered approach, where each step depends on the previous we also presented the results in this manner. We start with the results of our CrowdTask and spam detection. Next we look at the proposed weighting schemes and what effect they have on the distribution of the data. Moving on we reintroduce our performance measures (accuracy and rate of learning) and apply these to the results of our classifier. For each weighting approach and Majority Voting approach we show the results using our classifier.

5.1 CrowdTruth

As previously described our goal is to create a task where workers can annotate sentences from our dataset with maximal disagreement. To test if we succeeded we first ran a pilot. Our initial pilot consisted of 48 sentences in a single task. The goal was to find how many sentences a worker could annotate before either getting bored, deciding this task took to long or started spamming.

Our results show that 30 sentences is the maximum amount of sentences we can ask a worker to annotate and still expect the task to be finished in less then a month. Anymore and the worker stops before finishing the task. However there is no such on the number of tasks a worker is willing to do, meaning workers prefer shorter tasks over longer repetitive tasks.

After this pilot we ran the same task, incorporating the lessons learned, on the full dataset Running our crowd task took two months to get 230 sentences annotated by 15 workers each. We ended up with 23471 event labels of which contained 853 unique events.

5.1.1 Spam detection

An example of a spammer can be found in Table 5.1.1 row seven. This worker selected a single noun as annotation. While a spammer is not flagged on a single annotation for a single sentence this example does give an idea of the kind of annotations spammers tend to make.

(31)

23 5.1 CrowdTruth

WorkerID Annotations

31130954 [The quarter was terrible] [the future looks anything but encouraging]

32506823 [quarter] [future]

30901412 [quarter was terrible] [future looks anything but encouraging ]

11521789 [The quarter was terrible] [future looks anything but encouraging]

32352104 [looks]

29216801 [quarter] [future]

2037537 [quarter]

31994588 [The quarter was terrible]

31065274 [the future looks anything but encouraging] [The quarter was terrible]

33274628 [was] [looks]

7111544 [was terrible] [ future looks]

Table 5.1.1: Ten crowd annotations for ”The quarter was terrible , and the future looks anything but encouraging . ”

Figure 5.1.1: Worker Metrics Distributions, showing the Worker Worker Agreement (WA) , the Worker Sentence Score (WSS) and the Average Annotations per Sentence per Worker (A/S)

(32)

Chapter 5: Experimental Setup and Results 24

We labeled spammers in our initial pilot set by hand, estimated the thresholds of the pa-rameters by taking the score of the metrics setting the threshold to average-standard deviation per metric. If our metric score is higher then the threshold this worker was not a spammer according to this metric. Furthermore we calculated on how many of the metrics a user has to be flagged as a spammer before he is a true spammer based on our pilot. This approach is in line with the CrowdTruth Methodology

Our early results showed that a spammer would always have worker-worker -agreement scores and worker-sentence scores below the thresholds. To confirm our estimations we ran a second pilot and labeled the spammers by hand again. We then used our previously estimated thresholds to find the spammers. In our second pilot we were able to find four out of four total spammers. While not flagging any non spammers as spammers.

Spammers in task

Workers 38

Spammers 4

Table 5.1.2: Spammers in pilot

(a) Worker Sentence Score with spammers (b) Worker Worker Agreement with spammers Figure 5.1.2: Metrics Distribution with spammers

We found that our method was successful in removing all spammers in our pilot while not removing any non spammers. We then applied this spam removal to the full dataset. As this dataset and task are similar to the pilot we assume that our spam detection works as well on our full dataset.

5.1.2 Labeling

As our approach created a new set of labels we wanted to see if the outcome was comparable to the supplied annotated set. To do this we calculated the intersection set between the crowd Majority labels and the expert Labels. Our crowd has selected 776 tokens as events, from those events the experts agree on 692 instances meaning we have a shared set of events of 89.17%

(33)

25 5.2 Weighting

(a) Worker Sentence Score without spammers (b) Worker Worker Agreement without spammers Figure 5.1.3: Metrics Distribution without spammers

To furthermore see the effects of this labeling method we looked at the Pos tags of words and context words that were tagged as events.

Table 5.1.3: Context Words Pos Tag Frequencies

Word+2 Word+1 Word Word-1 Word -2 Total

Crowd 18 16 11 16 5 66

Expert 19 13 4 17 0 53

Table 5.1.4: Context word Weights

Average Standard Deviation Variance

Expert context Word Weights 1.0 0.0 0.0

Crowd context word Weights 0.51 0.20 0.048

5.2 Weighting

5.2.1 Sentence Clarity

We calculated the Sentence Clarity for each crowd-annotated sentence in the data set after removing spammers. Looking at our example sentence ” But in the past three months, stocks have plunged, interest rates have soared and the downturn across all Asia means that people are not spending here ” We get a sentence clarity of 0.39 , as sentence clarity has a scale of 0 to 1 where 1 is the most clear sentence, this is a low score.

(34)

Figure 5.2.1: Histogram of Sentence Clarity Score on train set

Mean 0.494

Median 0.4734

Standard Deviation 0.129

(35)

27 5.2 Weighting

5.2.2 Event-Sentence Clarity

We calculated the Event-Sentence Clarity for each crowd-annotated sentence in the data set after removing spammers. Every word in the sentence gets the same weight.

Looking at our example sentence ” But in the past three months, stocks have plunged, interest rates have soared and the downturn across all Asia means that people are not spending here ”

MeanShift Clusters

Cluster 0 stocks have plunged,

Cluster 1 interest rates have soared

Cluster 2 the downturn all across Asia

Cluster 3 people are not spending here

Cluster 4 in the past three months

(36)

See table 5.2.2 for the extracted segments. These clusters lead to the following Sentence Vector [7,4,6,6,3] which corresponds to the [0.58, 0.33, 0.49, 0.49, 0.25] Sentence-Event-Clarity

scores. This was done by taking the cosine of the unit vector with the frequency of each

individual cluster. When compared to the CrowdTruth Sentence Clarity metric of 0.39 we see that the second and last clusters have a lower score, however the first, third and fifth clusters have a much higher score. The events are now weighted as individual events according to the crowds importance. When using this metric in learning we can use the specific Sentence-Event-Clarity that corresponds to that event instead of generalizing over all the events in the sentence.

Figure 5.2.2: Histogram of Sentence-Event-Clarity Score on train set

Mean 0.217

Median 0.212

Table 5.2.3: Sentence-Event-Clarity Statistics

5.2.3 Crowd-TFIDF

We calculated the Crowd-TFIDF over the full crowd annotated dataset after removing spam-mers. As Crowd-TFIDF is calculated on word level there are much more scores when compared

(37)

29 5.3 Naive Bayes

to the sentence clarity

Figure 5.2.3: Histogram of Crowd TF-IDF Score on train set

Mean 0.262

Median 0.200

Table 5.2.4: Crowd TF-IDF Statistics

5.2.4 Majority Voting

We calculated the Crowd Majority Votes in two different ways, first we only took the word with the most annotations as the only event in the sentence. Our second method was to evaluate per word if the majority voted this word as an event.

5.3 Naive Bayes

5.3.1 Performance

We measure the performance in all following experiments by looking at the accuracy of the classifier and the rate of learning. For the rate of learning we created the features based on the

(38)

Method Majority voted Events

One Event per Sentence 226

Multiple Events per Sentence 776

Expert Events 692

Table 5.2.5: Events annotated

data we have seen so far, increasing the amount of data each iteration. In all experiments we test against the Expert annotated test set.

5.3.2 Majority Voting Naive Bayes

As expert has been selected by Majority Voting we can compare it to a crowd Majority voting scheme. We are interested in the accuracy and the learning curves. As an experiment we ran a Multinomial Naive Bayes(MNB) classifier. This classifier used as input the crowd Majority voted Labels. This input was generated by selecting the words that over half of the annotators annotated as event for the event class. We compared this with an Expert trained MNB Classifier which was trained on Expert Majority voted labels. The features remained the same.

To show the influence of using different weighting methods we set a baseline by training on crowd majority voted labels compared to the classifier that has been trained on expert labels. These labels only have spammers removed.

To show the effect of removing spammers we have also included the results without spam removal.

Label-set F1 Accuracy

Crowd Majority Voting(Multiple events per sentence) 0.61 0.51

Crowd Majority Voting (Single events per sentence) 0.59 0.48

Expert Majority Voting 0.72 0.66

Table 5.3.1: Majority Voting Accuracy comparison

As the results of the Crowd Majority Voting (Single events per sentence) are so poor we did not include them in further experiments

5.3.3 Crowd Weighted Multinomial Naive Bayes

We trained our MNB using different weighting schemes, Crowd Majority Votes(CM), Expert Majority votes(EM) Crowd labels with Sentence Clarity(Clarity), Crowd labels with Sentence Clarity(EClarity) and Crowd labels with Crowd TF-IDF(TF-IDF).

(39)

31 5.3 Naive Bayes

(a) F1 learning curves (b) Accuracy learning curves

Figure 5.3.1: Learning Curves

Label-set F1 Accuracy

MNB CrowdMajority Votes 0.61 0.51

MNB ExpertMajority votes 0.72 0.66

MNB Crowdlabels Clarity weighted 0.38 0.31

MNB Crowdlabels Sentence-Event-Clarity weighted 0.44 0.32

MNB Crowdlabels TF-IDF weighted 0.79 0.70

Table 5.3.2: Accuracy of MNB trained with CrowdLabels and weights

(a) F1 learning curves (b) Accuracy learning curves

(40)

(a) F1 spammers learning curves (b) Accuracy spammers learning curves Figure 5.3.3: Spammer Learning Curves with weights

(41)

Chapter 6

Discussion

In this chapter we discuss the results as presented in chapter 5. We first discuss the main findings of our study, before going deeper into the reasons behind these findings. We use the term multilabel to describe the set of labels consisting of the large set of crowd labeled data with noise and the term single label to describe the relative small set of expert labeled data with little to none noise.

The results in table 5.3.2 confirm our initial hypothesis that we can use crowd-annotated labels as input for a distribution based approach. This table shows how well our Crowd TF-IDF , Sentence Clarity and Sentence Event Clarity weighting approach compare against majority voting and expert labeling. The crowd TF-IDF weighting approach leads to an event classifier that outperforms all other approaches both on F1 and accuracy. Furthermore fig 5.3.2 shows the learning curves for each approach, showing that the crowd TF-IDF approach is able to learn the most effective given the training samples.

These results show that a classifier trained on a relative large set of crowd labeled data with noise can perform as well or outperform a classifier that has been trained on a relative small set of expert labeled data with little to none noise , even when tested against a test set consisting of expert labeled data with little to none noise. This data can be annotated via experts or crowd, confirming our hypothesis .

Our results also show that a single label based approach needs at least 5.500 samples fig 5.3.2 before starting to reach an F1 score of 0.65 while our Crowd TF-IDF approach already reaches this performance at 1.000 samples. However the slope of the learning curves are vastly different, this suggests that the expert based approach will reach similar or higher performance both on the F1 score and the accuracy. This is further supported by the research of Bethard an Martin showing a F1 score of 0.820 [7] using 90% (63.000 tokens) of the supplied documents as training set.

These results suggest that depending on the size of the training-set one can choose for an single label or a multilabel distribution approach.

We can further see that any clarity based approach or a crowd based Majority voting

(42)

Chapter 6: Discussion 34

proach has very little success. The weighting scheme of the Sentence Clarity was expected to under-perform as it gives a similar weight to all words in a sentence leading to noisy weights. Another possible explanation could might be found in in table 5.2.1 and figure 5.2.1 ,the data follows a normal distribution with a Standard Deviation of 0.129 this means that the weights of all the features in this set are between 0.344 and 0.602. There is perhaps not enough distinguish between these weights to help in training.

The Sentence Event Clarity metric was proposed as a solution for the weaknesses in Sentence clarity and while it does outperform Sentence Clarity it does not outperform single label (ex-pert) or event crowd Majority Voting. We suspect this combined with the performance of Sentence Clarity indicate that how well a sentence or event is agreed upon between annotators it not a good indication of the sample weight. It could also be true that while Sentence Clarity is to general, the Sentence Event Clarity has trouble forming fitting clusters using the mean-shift algorithm. If this is the case a better clustering algorithm could improve performance.

However the most likely explanation is due to the difference in segmentation. The test set has been segmented on word level, each word has or lacks an event label. Sentence Clarity is a sentence segmented approach thus learning to classify individual words with this approach becomes difficult. Sentence-event-clarity is a phrase segmented approach, again making learn-ing to classify individual words difficult. Crowd-TFIDF is a word segmented approach, this segmentation is the same as the test set segmentation. The results reflect this explanation.

Furthermore the difference in the histograms of the weights has to be noted. If we compare 5.2.1 to 5.2.2 we can see that in the sentence clarity there is a normal distribution centered around 0.45. This is due to weighting all words in a sentence equal and was expected. However when looking and the sentence-event-clarity histogram we see an entirely different graph. com-paring between both graphs is made difficult due to the different approaches however we can see that due to the segmentation approach we get more segments, however we introduce more low scores.

If we compare 5.2.3 to 5.2.2 we can once again see a shift to the left side. As crowd-TFIDF is segmenting on word level where as sentence-event-clarity is segmented on phrase level. So we once more introduce more tokens but there are fewer tokens that with high clarity scores.

It is also visible in our results that some of our classifiers have not yet reached their full potential with the amount of training data we have given. It could very well be that crowd TF-IDF is able to reach an higher F1 and accuracy. However we can not say this for sure. We suggest further research with a bigger training set.

Our main results also show that majority voted Crowd labels are outperformed by the ma-jority voted expert labels. We suspect this is due to the vast differences in the task. While experts were allowed to only annotated a single word crowd was given the option to select spans. Furthermore no definition of an event was given to the crowd while the expert had very strict guidelines on what is an event. We can see the effects of this in tables 5.1.3 and 5.1.4. This most likely would have led to a much better F1 and accuracy on the crowd Majority voted Classifier. However the freedom given to the crowd in annotation has lead to a more divers instantiation of the context words table 5.1.4. This could be one of the reasons our crowd TF-IDF algorithm is able to learn much faster. A task closer to the original annotation scheme might have helped

(43)

35 6.1 Future work

us skip the segmentation step in our preprocessing task but could have also lost us the context weights.

Spam removal is clearly not an optional step when using a distribution based method as seen when looking at 5.3.3 and 5.3.3b. We can clearly see the performance of all distribution based methods suffer majorly when spammers are not removed while the majority voted ap-proach does not suffer at all. This is due to the spammers being a minority, and when using the majority voted approach they are automatically removed.

The use of the metrics for spam detection seems to be validated when looking at figure 5.1.1 as we can see both the worker- worker agreement score and the worker sentence score have a bell shaped form and resemble a normal distribution. As both these metrics measure agree-ment anyone found highly disagreeing with the rest is likely a spammer. However it is important not to force the agreement to far as this would remove the weighted distribution used in training. Lastly the implementations of the weighting schemes have been setup to only distinguish between the appearance or absence of one single class. They can not be applied to multiple classes without extending these schemes.

We’ve shown a number of steps that can be taken to prepare a distribution of labels for learning and how using the metrics and spam removal we do not lose the important information. Furthermore, using this distribution our results show an improvement in both learning speed and accuracy compared to more traditional methods.

6.1 Future work

As future work we would like to gather a larger crowd annotated dataset in multiple domains to see the effectiveness of our proposed methods. We would also like to explore different weighting schemes such as log idf, log tf etc,These have proven to be efficient in similar approaches.

Furthermore we took an open annotation approach in our annotation gathering step, having an annotation task that is more strict and follows the same guidelines as expert annotations have would allow for easier comparison between expert Majority Voting and Crowd based approaches. In this case the segmentation step can also be skipped.

Lastly as our sentence-Event Clarity algorithm is not performing as well as hoped we suggest using a different cluster algorithm such as K-NN might be able to form the annotated clusters used in the sentence-Event Clarity algorithm better.

6.2 Conclusion

Our results show that using less data with more annotations gathered from a set of different annotators can be a viable alternative to collecting a large amount of data. To use the multiple annotations a novel preprocessing step is required, this step allows us to capture the agreement between annotators to use as sample weight. Furthermore, using these sample weights as a distribution our results show an improvement in both learning speed and accuracy compared to more traditional methods.

(44)

Chapter 6: Discussion 36

Preparing a viable crowd sourcing task is key when using this method, as the crowd has on average very little expertise. Creating clear and easy to understand tasks and having various spam detection methods can only help.

However comparing a crowd based approach to an expert based approach proves difficult due to the more fuzzy nature of a crowd based approach. We’ve tried to address this by introducing the segmentation step. Making sure that the segmentation is similar in both train en test set will lead to the best results, this needs to be taken into account when design a crowd annotation task.

(45)

Bibliography

[1] Charu C Aggarwal and ChengXiang Zhai. Mining text data. Springer Science & Business Media, 2012.

[2] Jeremy Ang, Rajdip Dhillon, Ashley Krupski, Elizabeth Shriberg, and Andreas Stolcke. Prosody-based automatic detection of annoyance and frustration in human-computer dia-log. 2002.

[3] Lora Aroyo and Chris Welty. Measuring crowd truth for medical relation extraction. In AAAI 2013 Fall Symposium on Semantics for Big Data, 2013.

[4] Lora Aroyo and Chris Welty. The Three Sides of CrowdTruth. Journal of Human Com-putation, 1:31–34, 2014.

[5] Lora Aroyo and Chris Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24, 2015.

[6] Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. Modern information retrieval, volume 463. ACM press New York, 1999.

[7] Steven Bethard and James H Martin. Identification of event mentions and their semantic class. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pages 146–154. Association for Computational Linguistics, 2006.

[8] Lukas Biewald. Massive multiplayer human computation for fun, money, and survival. In Andreas Harth and Nora Koch, editors, Current Trends in Web Engineering, volume 7059 of Lecture Notes in Computer Science, pages 171–176. Springer Berlin Heidelberg, 2012. [9] Daren C. Brabham. Crowdsourcing as a model for problem solving – an introduction and

cases, 2008.

[10] Yizong Cheng. Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8):790–799, Aug 1995.

[11] Nancy Chinchor, Erica Brown, Lisa Ferro, and Patty Robinson. 1999 named entity recog-nition task defirecog-nition. MITRE and SAIC, 1999.

[12] P. Dawid, A. M. Skene, A. P. Dawidt, and A. M. Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied Statistics, pages 20–28, 1979.

(46)

BIBLIOGRAPHY 38

[13] Charles L. Taylor Kurt Schock Doug Bond, J. Craig Jenkins. Mapping mass political conflict and civil society: Issues and prospects for the automated development of event data. The Journal of Conflict Resolution, 41(4):553–, 1997.

[14] Dayne Freitag. Machine learning for information extraction in informal domains. Mach. Learn., 39(2-3):169–202, May 2000.

[15] Julio Gonzalo, Felisa Verdejo, Irina Chugur, and Juan Cigarran. Indexing with wordnet synsets can improve text retrieval. arXiv preprint cmp-lg/9808002, 1998.

[16] M. Hirth, T. Hossfeld, and P. Tran-Gia. Cost-optimal validation mechanisms and cheat-detection for crowdsourcing platforms. In Innovative Mobile and Internet Services in Ubiq-uitous Computing (IMIS), 2011 Fifth International Conference on, pages 316–321, June 2011.

[17] Jeff Howe. Crowdsourcing: Why the Power of the Crowd Is Driving the Future of Business. Crown Publishing Group, New York, NY, USA, 1 edition, 2008.

[18] Oana Inel, Khalid Khamkham, Tatiana Cristea, Anca Dumitrache, Arne Rutjes, Jelle

van der Ploeg, Lukasz Romaszko, Lora Aroyo, and Robert-Jan Sips. CrowdTruth:

Machine-Human Computation Framework for Harnessing Disagreement in Gathering An-notated Data. In The Semantic Web–ISWC 2014, pages 486–504. Springer, 2014.

[19] Panagiotis G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. XRDS,

17(2):16–21, December 2010.

[20] AshrafM. Kibriya, Eibe Frank, Bernhard Pfahringer, and Geoffrey Holmes. Multinomial naive bayes for text categorization revisited. In GeoffreyI. Webb and Xinghuo Yu, editors, AI 2004: Advances in Artificial Intelligence, volume 3339 of Lecture Notes in Computer Science, pages 488–499. Springer Berlin Heidelberg, 2005.

[21] Diane J. Litman. Annotating student emotional states in spoken tutoring dialogues. In In Proc. 5th SIGdial Workshop on Discourse and Dialogue, pages 144–153, 2004.

[22] Hector Llorens, Estela Saquete, and Borja Navarro-Colorado. Timeml events

recogni-tion and classificarecogni-tion: Learning crf models with semantic roles. In Proceedings of the 23rd International Conference on Computational Linguistics, COLING ’10, pages 725– 733, Stroudsburg, PA, USA, 2010. Association for Computational Linguistics.

[23] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Sch¨utze. Introduction to

In-formation Retrieval. Cambridge University Press, Cambridge, Juli 2008.

[24] Andrew McCallum, Kamal Nigam, et al. A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, volume 752, pages 41–48. Citeseer, 1998.

[25] Danielle S. Mcnamara, Eileen Kintsch, Nancy Butler Songer, and Walter Kintsch. Are good texts always better? interactions of text coherence, background knowledge, and levels of understanding in learning from tex, 1993.

Harnessing disagreement in event text classiffication using CrowdTruth annotation

MSc Artificial Intelligence

Master Thesis

Harnessing disagreement in event text

classification using CrowdTruth

annotation

Maurits van Bellen

June 3, 2016

Supervisors:

Dr. M. van Someren (UvA)

Dr. L.M. Aroyo (VU)

Dr. Z. Szl`

avik (IBM)

Assessor:

Dr. L. Dorst (UvA

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Introduction

1.1.1

Event recognition in Natural Language Programming

1.1.2

Datasets

1.1.3

Crowd-sourcing

1.2

Problem statement

1.3

Research Questions and Objectives

1.3.1

Approach

1.4

Outline

Chapter 2

Related Work

2.1

Crowd-sourcing

2.2

CrowdTruth

2.2.1

Metrics

2.2.2

Sentence Relation Score

2.2.3

Sentence Clarity

2.2.4

Spam Detection

2.3

Event Classification

2.3.1

Features

2.3.2

Algorithms

2.3.3

Classifier evaluation

2.3.4

Feature Selection

Chapter 3

Method and Approach

3.1

Data Acquisition

3.1.1

Crowd Task

3.1.2

Segmentation

3.1.3

Spam detection

3.1.4

Weighting

3.1.5

Feature Selection

3.2

Feature Selection

3.2.1

Algorithm

3.2.2

Multinomial Naive Bayes