Modelling task and worker correlation for crowdsourcing label aggregation

(1)

MS

C

A

RTIFICIAL

I

NTELLIGENCE

M

ASTER

T

HESIS

Modelling task and worker correlation

for crowdsourcing label aggregation

by

I

OANNA

S

ANIDA 10876812

June 19, 2020

36 ECTS October 2019 - June 2020 Supervisor: Ms DANLI Assessor: Dr EVANGELOSKANOULAS UNIVERSITYOF AMSTERDAM

(2)

(3)

i

Abstract

by

Ioanna SANIDA

Labelled datasets are a popular field of machine learning research, both in academia and in industry. Obtaining high quality annotated datasets is a process that has been accelerated since the introduction of crowdsourcing services such as Amazon Turk and CrowdFlower. These services have made it possible to label large-scaled datasets in an efficient, low-cost and time-saving way. However, the quality of the labelled items is often inadequate and we observe small or noisy labels. Workers may be lacking knowledge on a particular topic and yet annotate items incorrectly, or purposely focus on the quantity of labelled items rather than the quality, given the monetary reward. Most existing studies that focus on the quality control of crowdsourced data and denoising crowdsourced labels, use probabilistic graphical models to infer the true label from noisy annotations. In this work we leverage additional information such as the correlation between task and worker, and integrate the abundant information associated with it, by proposing a model which is extending the GLAD model [35], and solves the three problems of inferring the expertise of each worker, the difficulty of each task, and the most probable label for each task.

(4)

(5)

iii

Acknowledgements

I would like to thank dr. Evangelos Kanoulas for giving me the opportunity to work with his team on this interesting research topic.

Moreover, I would like to give special thanks to my daily supervisor and PhD candidate Dan Li for her insightful guidance throughout the whole implementation of this thesis, as well as her valuable help with the formulation of the experiments, and putting me on track.

(6)

(7)

v

List of Figures

3.1 Graphical Representation of GLAD model . . . 10

3.2 Graphical Representation of GAMMA model . . . 12

4.1 Synthetic Dataset Generation . . . 19

5.1 Experiment 1 - Accuracy on Synthetic Datasets . . . 27

5.2 Experiment 1 - F1 Score on Synthetic Datasets . . . 29

(10)

(11)

ix

List of Tables

4.1 An overview of the Pseudo dataset configurations . . . 18

4.2 An overview of the cs2010 , cs2011 datasets configurations . . . 20

4.3 An overview of the TREC5-11 datasets Configurations . . . 20

4.4 All method overview . . . 21

5.1 Accuracy and F1-Score of a selected synthetic dataset . . . 24

5.2 Accuracy and F1-Score of a selected real-world dataset . . . 24

5.3 Accuracy and F1-Score of a selected synthetic dataset using qualification test. 26 5.4 Accuracy and F1-Score of a selected real-world dataset using qualification test. 26 5.5 Experiment 1 - Results for different annotation budget on synthetic and real-world data. . . 29

5.6 Experiment 1 - Results of different annotation quality on selected synthetic dataset (M=5,N=100).. . . 30

5.7 Results for different annotation budget on synthetic and real-world data. . . . 30

5.8 Results of different annotation quality on selected synthetic dataset (M=5,N=100). 31 5.9 MAE of 𝛼 values on selected synthetic dataset . . . 32

(12)

(13)

1

Chapter 1

Introduction

1.1 Motivation

In the age of data abundance and machine learning prevalence on multiple domains, it is crucial to use properly the enormous amount of available data, which crucial to train ma-chine learning models. Humans perceive the real-world by first observing the environmental variables and then classifying them into categories according to certain properties or charac-teristics. This does not abstain from how machine learning models are currently being trained on available datasets in order to solve a plethora of problems. Among the various challenges of machine learning, is the lack of labelled data. In other words, datasets which are tagged with one or more labels in order to identify and further classify the properties of each item of interest in the dataset. The importance of labelled data lies within the training process of a machine learning model. More explicitly, a model that is trained on a dataset whose labelled values are used as a ground truth, and then tested on unlabelled data of the same characteristics, using this ground truth, in order to predict a final output of high accuracy.

Crowdsourcing has revolutionized the gathering of labelled data, by letting crowds of workers (humans or algorithms) annotate items at a very low cost [30]. Crowdsourcing platforms such as Amazon Mechanical Turk [4], [6] or CrowdFlower [36], are distinctive examples of massive amounts of acquired labels from crowds. Despite the increased efficiency and high speed, a common issue that emerges from this technique, is the compromised quality of the labels for the different domains. That is due to the fact that various workers can label the same items, whether they are domain experts or not [27]. This is an important issue for specialized domains, where item classification has higher difficulty and requires expertise knowledge. Moreover, due to the anonymous nature of crowdsouring labelling and misaligned incentives, we observe cases of adversarial or spam workers [10]. Consecutively, the obtained labels for items that require a level of domain expertise, might be very noisy and of low quality. Thus, acquiring accurate labels from crowdsourcing platforms has become a bottleneck for progress in machine learning.

To overcome this obstacle, the labels given to each item from multiple workers are aggregated collectively and then the true label for each instance is inferred. The most simplistic method for this is Majority Voting [13]. Inferring the true label has been a subject of many studies that try to model label aggregation techniques and improve their accuracy under various standpoints. This assumption is then used to infer the true label of each item, the worker’s

(14)

2 Chapter 1. Introduction

expertise and the item’s difficulty. There are several approaches that model worker expertise and item difficulty as parameters of importance [21], [5], [33], [26], [19], [20], [24], [23]. The first advanced work for label aggregation is presented in [8], where they assume a global item difficulty for all workers and a global worker expertise for all items. However, these methods assume that all workers have the same level of expertise when they label an item that belongs to a specific domain. Moreover, it is implied that all items of a domain have the same level of difficulty, which is not the case in most real-life tasks. To address this issue, [35] proposes that labels are generated by a probability distribution over labels, workers and items. However, this also assumes that items’ difficulty is globally identical to all workers, and that workers’ expertise is globally identical to all items, something that is failing to integrate the correlation among items and workers. In practice, it is quite common that workers of high expertise tend to label items more accurately, i.e. giving labels related to the true label of the item. Similarly, items of high difficulty get a more diverse range of different labels when compared with easy items.

In this thesis, we extend the work of Whitehill et al.[35] by encoding the correlation of work-ers and items. We model worker-wise item difficulty and task-wise worker expertise, and by incorporating this information we aim to yield superior performance in terms of inferring the true label, as well as in terms of learning the parameters of interest. More specifically, we formulate a probabilistic model for the labelling process in order to infer the true label of the items more precisely. Our model aims to correctly infer the most accurate label for each item, as well as to infer each worker’s expertise parameter 𝛼, each item’s difficulty parameter 𝛽, and finally the correlation between the worker and the item, named 𝛾. Our newly proposed model will be referred to as GAMMA model.

For the inference part, we use Expectation-Maximization (EM) approach [14] in order to get the maximum likelihood estimates for the aforementioned parameters. We later observe how the proposed model performs in several dataset configurations: synthetic data, real-world data with manual annotations, and finally a mixture of manual and automatic annotations. We explore the performance in terms of accurately inferring the latent true label using variations of annotation budget and annotation quality. We compare the results of our GAMMA model with the GLAD model [35] baseline, as well as the Majority Vote (MV) method which is the most commonly used, and finally the Dawid & Skene model (EM) which uses confusion matrices to model each worker. The proposed GAMMA model outperforms all the baselines for different annotation budget and annotation quality on several small-scaled synthetic and real-world dataset configurations.

Next, we perform qualification tests in order to see how well GLAD and GAMMA models learn these parameters of interest (worker expertise and item difficulty), by finding the Mean Absolute Error of the obtained values and further preforming extensive analysis on each model’s quality and stability. It is observed that GAMMA outperforms the GLAD baseline on some sets of synthetic data. In this cases, the GAMMA model learn its parameters of interest more accurately and with great stability on the results as we see from the low variance of MAE.

(15)

1.2. Research Questions 3

and items improves greatly the inference performance of the latent true label. We see that our novel method beats the baselines and increases the overall accuracy remarkably, maintaining superior quality on small scaled datasets. The inference complexity of our model does not allow for scaling up to very large datasets, as it becomes computationally expensive.

1.2 Research Questions

The main research question of this master thesis project is whether we can infer the latent true label from noisy crowdsourced labels more accurately than the existing methods. In order to do so, we aim at answering the following four sub-questions:

First, we examine whether modelling worker-wise item difficulty or task-wise worker ex-pertise, brings better results, than the current methods which consider that item difficulty is globally identical to all workers, and worker expertise is globally identical to all items. We want to compare the proposed method with the baselines in terms of truth inference accuracy.

Research question 1. Can the true label of an item be inferred accurately by proposing a new model that captures the correlation between worker expertise and item difficulty ?

Furthermore, we examine whether the performance of the model can be improved when we use a pre-estimation of worker expertise and item difficulty, i.e. when we know the ground truth worker expertise and ground truth item difficulty.

Research question 2. Can a pre-estimation of worker expertise and item difficulty help the model to infer latent true label more accurately?

To this end, we further investigate whether inferring the true label can be affected by other factors, such as a different number of annotations, or adding some noise to the crowd label. The motivation behind this, is that a larger amount of annotations could provide more infor-mation regarding the item’s difficulty or worker’s expertise. Moreover, by adding some noise to the crowd label when the correlation gets too low or too high, we aim to observe improved performance.

Research question 3. Can the performance of the proposed model be influenced by different annotation budget and annotation quality ?

Finally, we evaluate how correctly it learns its parameters of interest: worker expertise, item difficulty and their correlation. We compare our method’s performance with the GLAD base-line in terms of Mean Absolute Error and examine the stability of their results by measuring the variance of the error.

Research question 4. How well does the proposed model learn its parameters of interest (worker expertise, item difficulty and their correlation) ?

1.3 Contributions

(16)

4 Chapter 1. Introduction

∙ A novel probabilistic model that boosts true label inference from noisy crowdsourced annotations, by modelling the correlation between items and workers and the corre-sponding inference algorithm.

∙ Empirical validation of the method on both synthetic datasets and real-world datasets. ∙ A thorough analysis of the influence of the annotation budget and the annotation quality

on model performance.

∙ An extensive study on the working mechanism of the model with respect to its param-eters of interest.

1.4 Outline of Thesis

The rest of this thesis is as follows:

Chapter2gives an extensive overview of all the related work that has been done in the field of denoising crowdsourcing label aggregation, from the most simplistic approaches to the current state-of-the-art methods.

In Chapter3, we introduce the novel probabilistic model that captures correlation informa-tion between worker expertise and item difficulty, and the algorithmic approach used for inference.

Chapter4introduces the datasets, the baselines and the metrics for this project.

In Chapter5we report and further analyse our results.

Finally, in Chapter6 we summarize the findings of this work and provide ideas for future improvements or extensions of this model.

(17)

5

Chapter 2

Related Work

In this chapter we focus on the work that has been done in the field of label aggregation and the truth inference problem. We will first introduce the approaches that have been fol-lowed so far, and then we will present the areas of interest to the research questions that were previously defined. We will explain the models that have been used as baselines and more specifically the GLAD model [35] which we extend in order to incorporate correlation information.

2.1 Probabilistc models for label aggregation

The majority of research work in label aggregation has been focusing on inferring the true label and estimating the competence of a worker [8], [35], [22], [9], [24], [33], [5], [21], [37], [23], [12], [18].

The simplest method of label aggregation for inferring the true label is the Majority Voting (MV). MV assumes the answers given by the workers as the truth and that all workers are equally competent, i.e. they make the same contribution. This is the main drawback of this method, as well as that it only assumes a small number of responses per task, making aggregation on such data unreliable [17].

In addition to truth inference, worker reliability is a parameter of high impact. The intuition behind it is that a more competent worker will have higher probability of correctly labelling an item. Our research focuses not only on these two, but also takes into account another parameter, which is the difficulty of the item. That is because an item of higher difficulty may be labelled incorrectly more frequently, so finding the correlation of these parameters could assist in improving the inference of the true label.

Notation Let’s consider that there are 𝑀 workers that label 𝑁 items into 2 classes (0 or 1). We want to infer the latent true label 𝑧𝑗 of item 𝑗. Let 𝑦𝑖𝑗 be the given label from worker 𝑖 to

item 𝑗.

The first approach on label aggregation was proposed by Dawid and Skene in 1979 [8]. They used medical data that contained information about patients’ history, labelled by vari-ous workers (clinicians). Since not all workers give the same label for the same item, they assumed that the labels are independent, given the true responses. They used a confusion matrix 𝐾 × 𝐾 where 𝐾 is the number of classes, to parametrise the worker labels given the

(18)

6 Chapter 2. Related Work

item’s true annotation 𝑃 (𝑦𝑖𝑗|𝑧𝑗).

𝑣𝑖𝑘𝑙 = 𝑃 (𝑦𝑖𝑗 = 𝑙|𝑧𝑗 = 𝑘)

They also used a categorical distribution to parametrise 𝑃 (𝑧𝑗), introducing 𝜏𝑘= 𝑃 (𝑧𝑗 = 𝑘).

Then they used the maximum likelihood estimation in order to aggregate the labels from the workers and infer the true label of the items. Their assumption was that a more competent worker, should have higher probability on correctly labelling an item, and thus higher impact on the true label.

An extension of DS model, named LFC, was proposed by [25], where they added a Dirichlet prior to the confusion matrix and the categorical distribution. The estimation of the latent true label is performed similarly as in the DS model. Additionally in [7], each worker has one confusion matrix per query and it is possible to have different reliability scores per different tasks. A drawback of this method is that it cannot capture a worker’s answering capability fully, due to the fact that each worker probability is modelled as a single value.

A Bayesian extension of the LFC model was introduced in [18], which explores the idea of combining the predictions of many different classifiers that are not required to be probabilis-tic. This method needs priors on the parameters and assumes that the classifier outputs are independent, given the true label 𝑧𝑗. They used Gibbs sampling to to sample 𝑍, 𝜏 and 𝑉

from 𝑃 (𝑍, 𝜏, 𝑉 |𝑌, 𝛼, 𝛽), and then find the maximum estimate.

Variational Bayesian inference [29] outperformed the aforementioned model, being computa-tionally more efficient and having higher accuracy of predictions. Nevertheless, the downside of this method is that for many classification methods it is not possible to compute marginal likelihood, and that not all of the classifiers assume the same priors or observe the same data, making it challenging for the model to deal with such cases. Another challenging aspect is the assumption that the classifiers are independent, which is rarely the case and results in low performance. The EBCC model which is an extension of iBCC, uses different confusion matrices according to the difficulty of each data point for classification. However, it is not characterized by significant performance improvement.

Whitehill et all [35], proposed the GLAD model (Generative model of Labels, Abilities and Difficulties) on 2009. GLAD is the first probabilistic model that infers the true label of the item, the expertise of each worker 𝛼𝑖 ∈ (−∞, +∞), as well as the difficulty of each item

1/𝛽𝑗 ∈ [0, ∞).

The probability of a label given to an item by a worker being the same as the latent true label, given the worker’s expertise and the item’s difficulty, is generated as follows:

𝑃 (𝑦𝑖𝑗 = 𝑧𝑗|𝛼𝑖, 𝛽𝑗) =

1

1 + 𝑒−𝛼𝑖𝛽𝑗 (2.1)

(19)

2.2. Correlation approaches 7 𝑃 (𝑦𝑖𝑗) = ⎧ ⎨ ⎩ 𝜌𝑖𝑗, if 𝑦𝑖𝑗 = 𝑧𝑗 1−𝜌𝑖𝑗 𝐾−1, 𝑦𝑖𝑗 ̸= 𝑧𝑗. (2.2)

According to the model, the log probability of the retrieved labels being the same as the true label, depends on the worker’s expertise and the item’s difficulty. The more experienced a worker is, the higher the probability of their labelling correctly an item. Similarly, the more difficult an item is, the higher the probability that it will be labelled incorrectly, i.e.:

𝜌𝑖𝑗 = log

𝑃 (𝑦𝑖𝑗 = 𝑧𝑖𝑗)

1 − 𝑃 (𝑦𝑖𝑗 = 𝑧𝑗)

= 𝛼𝑖𝛽𝑗 (2.3)

The GLAD model consists of a strong baseline, since it correctly infers the true label 𝑧𝑗even

when the 𝑧𝑗 value is the minority option. Moreover, it presumes that when a worker has high

expertise on a task, their vote should have more weight on this particular task comparing to less skilled workers. In combination with measuring task difficulty, it improves the ratio of correctly labelled items comparing to other methods. However, this process has some limitations as it does not scale well to large datasets. That is due to its ability to handle only one parameter per task and one expertise parameter per worker.

In our approach, we extend the GLAD model by integrating the correlation between workers and items on the inference of the true label, to improve label quality control.

2.2 Correlation approaches

Ruvolo et al. in [28] propose a probabilistic model that exploits both commonality among different workers and interaction effects between workers and data automatically, using a latent product-of-factors approach. This method is an extension of the GLAD and the DS model, where it uses worker and instance features of very low dimension (1dim to 3 dim). Although there is improvement over the pre-existing models, the baseline selection seems random and there is no variance estimation.

A new minimax lower bound analysis theory is proposed in [15], where they derive a min-imax error rate under more practical setting for a broader class of crowdsourcing models including the DS model as a special case. They also propose a worker clustering model, which is a slightly improved extension of the DS model. In [33] they model each worker using several side information such as bias, expertise and competence. This allows for in-tegration of worker information valuable for defining loss functions and for our assigning workers into different groups depending on their expertise and on how they perceive each item.

2.3 Inference method

Inferring the true label is a process that varies among the aforementioned models. There are three main inference methods used, which include Expectation Maximization [14], Gibbs

(20)

8 Chapter 2. Related Work

Sampling [31] and expectation propagation. Generally, most of the existing works sample one of these inference methods. In DS [8], the inference method is Expectation-Maximization. Here, inference is an analytic solution, since the likelihood function is an exponential func-tion. The confusion matrix consists of 𝑁 × 𝑀 × 𝐾 values, given that there are 𝑀 workers that label 𝑁 items into 𝐾 categories.

Expectation-Maximization is also used in LFC model [25], where they use an analytic solu-tion for 𝛼, 𝛽, but for 𝑊 they use gradient descent. For the proposed models iBCC, dBCC, eBCC [18], the inference is a combination of EM and Gibbs sampling. Some parameters in the form of an exponential function have analytical posterior, thus they can be Gibbs sampled easily.

GLAD model uses EM algorithm as well, with the likelihood containing logit function and not exponential function in the E-step. In the M-step it uses gradient descent. There are 𝑁 × 𝐾 possible values when we calculate the posterior of 𝑍.

Jin et al. in [17] use EM and Gibbs sampling to infer the true label. The process alternates be-tween the collapsed Gibbs sampling for latent variables, given the current estimates of model parameters, and the one-step gradient for updating model parameters given the samples latent variables. [28] uses EM and conjugates gradient descent for 𝛼, 𝛽 and w. Finally, Jin et al. in [16] use Gibbs sampling and gradient descent to incorporate prior knowledge and capture the relatedness of response categories alongside variables for 𝛼𝑖, 𝛽𝑗 and w. The analytic

so-lution alternates between : (1) Collapsed Gibbs sampling for inferring the true labels of the items L, given the current estimates of the other model parameters. This includes 𝛼𝑖, 𝛽𝑗 and

the relatedness matrix 𝑆. (2) Limited-memory BFGS until it converges, for updating these parameters given the current assignment of L.

(21)

9

Chapter 3

Methodology

In this Chapter we will present the model that we propose in order to solve the task of infer-ring the true label from crowdsourcing data, by leveraging correlation information.

GLAD model [35] assumes that the expertise of each worker is the same for every item and that the difficulty of each item is the same for every worker. However, this is not pragmatic because different workers have diverse background knowledge across different items, mean-ing that they are more familiar with some items rather than others. Conversely, items that require a certain level of domain knowledge and worker expertise for correct annotation, are less likely to be given the right label than easier items. Our hypothesis is based upon the intuition that we need a more fine-grained modelling approach of worker expertise and item difficulty. For that reason, we are extending the GLAD model which models worker expertise 𝛼 and item difficulty 𝛽, by considering a correlation matrix 𝛾 between workers and items. Section3.1introduces the GLAD probabilistic model. In Sections3.2and3.3we introduce the probabilistic model and the inference part of our method respectively.

3.1 Preliminary: GLAD Probabilistic Model

The baseline of our approach is modelled by Whitehill et al. in [35]. Considering a number of 𝑁 items 𝑗 ∈ {1, 2, ..., 𝑁 } labelled by 𝑀 workers 𝑖 ∈ {1, 2, ..., 𝑀 } into 𝐾 categories, the goal is to determine the true class label 𝑧𝑗 of item 𝑗. As mentioned in Section2.1, the

generative process of the labels given by worker 𝑖 to item 𝑗, is given by :

𝑃 (𝑦𝑖𝑗 = 𝑧𝑗|𝛼𝑖, 𝛽𝑗) =

1

1 + 𝑒−𝛼𝑖𝛽𝑗 (3.1)

The graphical representation of GLAD is shown in Figure3.1, where we see how it models true labels 𝑧𝑗, worker expertise 𝛼𝑖, item difficulty 𝛽𝑗 and the labels 𝑦𝑖𝑗 given by worker 𝑖 to

item 𝑗. The goal of this model is to accurately infer Z, 𝛼, 𝛽 given the observed data. For that purpose they use Expectation Maximization [14] where first at the E-Step they calculate the posterior probabilities of 𝑧𝑗 and then, in the M-step they find the values of 𝛼 and 𝛽 that

max-imize the auxiliary function 𝑄, using gradient ascent. Function 𝑄 is the expectation of joint log-likelihood estimation of the true label and the observed label, given the worker expertise and item difficulty. It is clear from Figure3.1, that this method considers the parameters of interest independent from each other.

(22)

10 Chapter 3. Methodology 𝛽1 𝛽2 𝛽3 . . . 𝛽𝑛 𝑧1 𝑧2 𝑧3 . . . 𝑧𝑛 𝑦11 𝑦21 . . . 𝑦12 𝑦22 𝑦32 . . . 𝛼1 𝛼2 𝛼3 . . . 𝛼𝑚 Item Difficulties True Labels Observed Labels Worker Expertise

FIGURE3.1: Graphical Representation of GLAD model

3.2 Probabilistic Model

As stated above, our approach aims to capture the correlation between worker expertise and item difficulty, in order to infer more accurately the true label of an item. For that purpose, we extend the GLAD model which already considers the two aforementioned variables, and thus can be used as a strong baseline.

Let’s consider again 𝑁 items 𝑗 ∈ {1, 2, ..., 𝑁 } labelled by 𝑀 workers 𝑖 ∈ {1, 2, ..., 𝑀 } into 𝐾 categories. The goal is to determine the true class label 𝑧𝑗 of item 𝑗. The probability of

worker 𝑖 giving the correct label to item 𝑗, is given by:

𝑃 (𝑦𝑖𝑗) = ⎧ ⎪ ⎨ ⎪ ⎩ 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗, if 𝑦𝑖𝑗 = 𝑧𝑗 1 𝐾−1 (︁ 1 − 1 1+𝑒−𝛼𝑖𝛽𝑗 𝛾𝑖𝑗 )︁ , if 𝑦𝑖𝑗 ̸= 𝑧𝑗. (3.2)

As explained in Section2.1, worker expertise is modelled by the parameter 𝛼𝑖 ∈ (−∞, +∞),

with 𝛼 → +∞ denoting a highly expert worker who always labels correctly the items. 𝛼 → −∞ means a a worker who always labels incorrectly the items. 𝛼 < 0 is an adversarial worker, and 𝛼 = 0 is a worker who cannot distinguish which class is the correct one.

The difficulty of each item 𝑗 is modelled by the parameter 1/𝛽𝑗 ∈ [0, ∞), and 𝛽𝑗 is always

(23)

3.2. Probabilistic Model 11

label it correctly. If 1/𝛽 → ∞, then the item is very difficult, and it is less likely that many workers will label it correctly.

The correlation matrix 𝛾𝑖𝑗 ∈ [0, ∞), is also constrained to be positive. If 𝛾𝑖𝑗 → ∞, then

worker 𝑖 and item 𝑗 are highly correlated. If 𝛾𝑖𝑗 = 0, they are not correlated.

The joint distribution of 𝑌, 𝑍 is the following:

𝑃 (𝑌, 𝑍) = 𝑁 ∏︁ 𝑗=1 𝑃 (𝑧𝑗) 𝑀 ∏︁ 𝑖=1 𝑃 (𝑦𝑖𝑗|𝑧𝑗) (3.3)

Using the above equation, we can compute the likelihood function of the observed data 𝑌 given the parameters 𝛼, 𝛽 and 𝛾, as shown below:

𝑃 (𝑌 |𝛼, 𝛽, 𝛾) = 𝑁 ∏︁ 𝑗=1 𝐾 ∑︁ 𝑘=1 𝑃 (𝑧𝑗 = 𝑘) · 𝑀 ∏︁ 𝑖=1 (︁ 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁𝐼(𝑦𝑖𝑗=𝑧𝑗)(︁ 1 𝐾 − 1 (︁ 1 − 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁)︁𝐼(𝑦𝑖𝑗̸=𝑧𝑗) (3.4)

The next step is to use the log of the above Equation (3.4), so that we can calculate the log-likelihood of the observed data 𝑌 , that we later want to maximize. This is given by:

ln 𝑃 (𝑌 |𝛼, 𝛽, 𝛾) = 𝑁 ∑︁ 𝑗=1 ln [︃ _𝐾 ∑︁ 𝑘=1 𝑃 (𝑧𝑗 = 𝑘) 𝑀 ∏︁ 𝑖=1 (︁ 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁𝐼(𝑦𝑖𝑗=𝑧𝑗) ·(︁ 1 𝐾 − 1 (︁ 1 − 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁)︁𝐼(𝑦𝑖𝑗̸=𝑧𝑗) ]︃ (3.5)

The likelihood function of observed and hidden variables is calculated as shown below:

𝑃 (𝑌, 𝑍|𝛼, 𝛽, 𝛾) = 𝑁 ∏︁ 𝑗=1 𝑃 (𝑧𝑗) · 𝑀 ∏︁ 𝑖=1 (︁ 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁𝐼(𝑦𝑖𝑗=𝑧𝑗)(︁ 1 𝐾 − 1 (︁ 1 − 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁)︁𝐼(𝑦𝑖𝑗̸=𝑧𝑗) (3.6)

We want to use Maximum Log-Likelihood Estimation to learn parameters 𝛼, 𝛽 and 𝛾, but this is not calculable. The derivative is not feasible for MLE because we have the summa-tion inside the log funcsumma-tion. Instead, we use Expectasumma-tion Maximizasumma-tion algorithm which we will explain in the next Section (3.3). Equation3.7 is used in order to calculate the aux-iliary 𝑄 function in the M-Step. That is the log of Equation3.6 which was shown above. Consecutively, we now have:

(24)

12 Chapter 3. Methodology 𝛽1 𝛽2 𝛽3 . . . 𝛽𝑛 𝑧1 𝑧2 𝑧3 . . . 𝑧𝑛 𝑦11 𝑦21 . . . 𝑦12 𝑦22 𝑦32 . . . 𝛼1 𝛼2 𝛼3 . . . 𝛼𝑚 𝛾11 𝛾21 . . . 𝛾12 𝛾22 . . . Item Difficulties True Labels Observed Labels Worker Expertise Worker-Item Correlation

FIGURE3.2: Graphical Representation of GAMMA model

ln 𝑃 (𝑌, 𝑍|𝛼, 𝛽, 𝛾) = 𝑁 ∑︁ 𝑗=1 [︃ ln 𝑃 (𝑧𝑗) + 𝑀 ∑︁ 𝑖=1 ln (︁ 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁𝐼(𝑦𝑖𝑗=𝑧𝑗) + 𝑀 ∑︁ 𝑖=1 ln (︁ 1 𝐾 − 1 (︁ 1 − 1 1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 )︁)︁𝐼(𝑦𝑖𝑗̸=𝑧𝑗) ]︃ (3.7)

3.3 Inference

For the inference part our goal is to accurately search for the most probable values of the unobserved variables, using the Expectation Maximization algorithm, which obtains the es-timates of the aforementioned parameters.

As explained above, we assume that the priors of 𝛼 and 𝛽 are drawn from a normal Gaussian distribution with 𝜇=1 and 𝜎=1. In that case, the EM algorithm is used to find MAP (maximum a posteriori) solutions for our model. The E-Step we calculate the posterior of 𝑍 and in the

(25)

3.3. Inference 13

M-Step we update with gradient ascent the auxiliary function 𝑄, which is the expectation of the joint log-likelihood (Equation3.7), over the posterior distribution obtained from3.8.

E-Step : Here we will calculate the posterior probability of all the latent true variables 𝑍, given the values of 𝛼, 𝛽 and 𝛾 that were obtained at the last M-Step, as well as the given label values 𝑌 .

In our case, we have only two distinct labels 0 or 1, thus 𝐾 = 2 and 𝑧𝑗 ∈ {0, 1}. It is not

assumed that every worker has to label each and every item, so we define the set of all given labels from worker 𝑖 to an item 𝑗 , as y𝑗 = {𝑦𝑖𝑗|𝑖 ∈ 𝐼𝑗}. 𝐼𝑗 denotes the worker index who

labelled item 𝑗.

The posterior probability is then given by:

In Figure3.2, we see the graphical representation of the proposed gamma model, and more specifically the conditional independence allows us to denote 𝑃 (𝑧𝑗|𝛼, 𝛽𝑗, 𝛾𝑖𝑗) = 𝑃 (𝑧𝑗), and

thus replace 𝑃 (𝑧𝑗|𝛼, 𝛽𝑗, 𝛾𝑖𝑗) with 𝑃 (𝑧𝑗), at the last line of Equation3.8.

We now denote the priors for the two classes:

𝑃 (𝑧𝑗 = 0) = 𝜋𝑖0

𝑃 (𝑧𝑗 = 1) = 𝜋𝑖1

(3.9)

M-Step : In this step we obtain the values of 𝛼, 𝛽, 𝛾 from the previous E-Step, and then we maximize the auxiliary function 𝑄. As described earlier, 𝑄 is the expectation of the joint log-likelihood of observed and hidden variables obtained from3.7, over the posterior distribution obtained from3.8. 𝑄 function is the lower bound of Equation3.5, and since it is not possible to maximize3.5, we maximize its lower bound. For this purpose we use gradient ascent and find at each M-Step the values of 𝛼, 𝛽, 𝛾 that maximize 𝑄.

𝑄(𝛼, 𝛽, 𝛾) , E[ln 𝑃 (y, z|𝛼, 𝛽, 𝛾)] = E [︃ ln∏︁ 𝑗 𝑃 (𝑧𝑗) (︃ ∏︁ 𝑖 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) ⏟ ⏞ ∝ posterior of 𝑧 )︃]︃ =∑︁ 𝑗 E[ln 𝑃 (𝑧𝑗)] + ∑︁ 𝑖𝑗 E[ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗)] ⏟ ⏞

Since 𝑦𝑖𝑗are cond. ind. given z, 𝛼, 𝛽, 𝛾.

(3.10)

(26)

We use Kronecker’s delta function which returns 1 if the given label is equal to the true label, and 0 otherwise:

𝛿 , I{𝑦𝑖𝑗=𝑧𝑗}

And define the sigmoid function:

𝜎 , 1

1 + 𝑒−𝛼𝑖𝛽𝑗𝛾𝑖𝑗 (3.12)

We know that the partial derivative of the sigmoid function is given by:

𝜕 𝜕𝑥𝜎(𝑥) = 𝜕 𝜕𝑥( 1 1 + 𝑒−𝑥) = 𝜕 𝜕𝑥(1 + 𝑒 −𝑥₎−1₌ 𝑒−𝑥 (1 + 𝑒−𝑥₎2 = 1 1 + 𝑒−𝑥 · 𝑒−𝑥 1 + 𝑒−𝑥 = 1 1 + 𝑒−𝑥 · (1 + 𝑒−𝑥) − 1 1 + 𝑒−𝑥 = 1 1 + 𝑒−𝑥(1 − 1 1 + 𝑒−𝑥) = 𝜎(𝑥)(1 − 𝜎(𝑥)) (3.13)

Using3.12and3.6, 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼, 𝛽, 𝛾) obtained from the E-Step can be written as:

𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼, 𝛽, 𝛾) = I{𝑦𝑖𝑗=𝑧𝑗}𝜎 + I{𝑦𝑖𝑗̸=𝑧𝑗}

(︁1 − 𝜎 𝐾 − 1

)︁

(3.14)

Moreover, ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼, 𝛽, 𝛾) which is part of3.7, is defined as:

ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼, 𝛽, 𝛾) = 𝛿 ln 𝜎 + (1 − 𝛿)(ln(1 − 𝜎) − ln(𝐾 − 1)) (3.15)

The next step is to compute the gradient ascent in order to find the values of 𝛼, 𝛽, 𝛾 that maximize locally the auxiliary function 𝑄.

(27)

3.3. Inference 15 𝜕𝑄 𝜕𝛼𝑖 =∑︁ 𝑖𝑗 [︁ 𝑃 (𝑧𝑗 = 1|𝑌 ) ⏟ ⏞

term (1) explanation below

· 𝜕 𝜕𝛼𝑖

ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 1, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗)

+ 𝑃 (𝑧𝑗 = 0|𝑌 )

⏟ ⏞

term (2) explanation below

· 𝜕 𝜕𝛼𝑖 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 0, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) ]︁ = ∑︁ 𝑗∈𝑊𝑖 𝜕 𝜕𝛼𝑖 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) = ∑︁ 𝑗∈𝑊𝑖 𝜕 𝜕𝛼𝑖 [︁ 𝛿 ln 𝜎 + (1 − 𝛿)(ln(1 − 𝜎) − ln(𝐾 − 1)) ]︁ 𝜕 𝜕𝛼𝑖 𝜎(𝛼𝑖) = ∑︁ 𝑗∈𝑊𝑖 𝛿1 𝜎𝜎(1 − 𝜎)(−1)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗_{+ (1 − 𝛿)}𝐾 − 1 1 − 𝜎 · (−1) 𝐾 − 1𝜎(1 − 𝜎)(−1)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗 = ∑︁ 𝑗∈𝑊𝑖 𝛿(𝜎 − 1)𝑒𝛽𝑗_𝑒𝛾𝑖𝑗 _{+ (1 − 𝛿)𝜎𝑒}𝛽𝑗_𝑒𝛾𝑖𝑗 = ∑︁ 𝑗∈𝑊𝑖 𝑒𝛽𝑒𝛾(𝛿𝜎 − 𝛿 + 𝜎 − 𝛿𝜎) = ∑︁ 𝑗∈𝑊𝑖 𝑒𝛽𝑗_𝑒𝛾𝑖𝑗_{(𝜎 − 𝛿)} (3.16)

We have the same expression for terms (1) and (2) and since the summation of these two terms returns a value of 1, we merge them. The same applies to the next two gradient calculations on Equations3.17,3.18.

Moreover,∑︀

𝑗∈𝑊𝑖denotes the summation of 𝑗 items annotated by Worker 𝑖.

𝜕𝑄 𝜕𝛽𝑗 =∑︁ 𝑖𝑗 [︁ 𝑃 (𝑧𝑗 = 1|𝑌 ) · 𝜕 𝜕𝛽𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 1, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) + 𝑃 (𝑧𝑗 = 0|𝑌 ) · 𝜕 𝜕𝛽𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 0, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) ]︁ = ∑︁ 𝑖∈𝐸𝑗 𝜕 𝜕𝛽𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) = ∑︁ 𝑖∈𝐸𝑗 𝜕 𝜕𝛽𝑗 [︁ 𝛿 ln 𝜎 + (1 − 𝛿)(ln(1 − 𝜎) − ln(𝐾 − 1)) ]︁ 𝜕 𝜕𝛽𝑗 𝜎(𝛽𝑗) = ∑︁ 𝑖∈𝐸𝑗 𝛿1 𝜎𝜎(1 − 𝜎)(−𝛼𝑖)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗_{+ (1 − 𝛿)}𝐾 − 1 1 − 𝜎 · (−1) 𝐾 − 1𝜎(1 − 𝜎)(−𝛼𝑖)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗 = ∑︁ 𝑖∈𝐸𝑗 𝛿(𝜎 − 1)𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗+ (1 − 𝛿)𝜎𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗 = ∑︁ 𝑖∈𝐸𝑗 𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗(𝛿𝜎 − 𝛿 + 𝜎 − 𝛿𝜎) = ∑︁ 𝑖∈𝐸𝑗 𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗(𝜎 − 𝛿) (3.17)

(28)

16 Chapter 3. Methodology

Where∑︀

𝑖∈𝐸𝑗denotes the summation of workers i that annotated Example 𝑗.

𝜕𝑄 𝜕𝛾𝑖𝑗 =∑︁ 𝑖𝑗 [︁ 𝑃 (𝑧𝑗 = 1|𝑌 ) · 𝜕 𝜕𝛾𝑖𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 1, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) + 𝑃 (𝑧𝑗 = 0|𝑌 ) · 𝜕 𝜕𝛾𝑖𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗 = 0, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) ]︁ = 𝜕 𝜕𝛾𝑖𝑗 ln 𝑃 (𝑦𝑖𝑗|𝑧𝑗, 𝛼𝑖, 𝛽𝑗, 𝛾𝑖𝑗) = 𝜕 𝜕𝛾𝑖𝑗 [︁ 𝛿 ln 𝜎 + (1 − 𝛿)(ln(1 − 𝜎) − ln(𝐾 − 1)) ]︁ 𝜕 𝜕𝛾𝑖𝑗 𝜎(𝛾𝑖𝑗) = 𝛿1 𝜎𝜎(1 − 𝜎)(−𝛼𝑖)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗_{+ (1 − 𝛿)}𝐾 − 1 1 − 𝜎 · (−1) 𝐾 − 1𝜎(1 − 𝜎)(−𝛼𝑖)𝑒 𝛽𝑗_𝑒𝛾𝑖𝑗 = 𝛿(𝜎 − 1)𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗+ (1 − 𝛿)𝜎𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗 = 𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗(𝛿𝜎 − 𝛿 + 𝜎 − 𝛿𝜎) = 𝛼𝑖𝑒𝛽𝑗𝑒𝛾𝑖𝑗(𝜎 − 𝛿) (3.18)

In E-Step there is linear computational complexity in the total number of labels and the number of items. For the M-step, we compute the values of 𝑄 and its gradient in each step until convergence. The convergence is reached when the likelihood difference between two steps reaches a certain threshold, for example 1𝑒−4, which was empirically studied to be the optimal.

Finally, if we want to add a prior for all the parameters 𝛼, 𝛽 and 𝛾, we add an extra log item of the corresponding items on the 𝑄 function. That is the difference of the optimization process when there is a prior. The prior is drawn by a normal Gaussian distribution with mean 1 and variance 1.

(29)

17

Chapter 4

Experimental Setup

In this Chapter we will first analyse the datasets that were used. In Subsection4.1.1we will explain how the synthetic data were generated and in Subsections4.1.2, which real-world datasets we selected and how we modified them to the needs of this thesis project.

Next, in Section4.2we will go through the methods we have defined as our baselines.

Finally, in Section4.3we will present the metrics that we have used in order to evaluate the results of our method.

4.1 Datasets

A great variety of crowdsourcing datasets is available publicly, that contains information such as the given crowd labels of the items as well as the ground truth labels, along with information of the worker, the document text and the query text. We will describe the selected datasets that satisfied our needs, as well as the method we used in order to simulate workers and annotated items for model verification.

4.1.1 Synthetic Datasets

In this section we generate synthetic datasets in order to explore the model’s ability to recover the ground truth and learn the correlations in order to infer the true label. Overall, the data is generated in three steps:

1. We generate the 𝑁 items.

2. We generate the 𝑀 workers.

3. We sample a number of workers for each item and then we generate the crowd labels.

In each step we have corresponding assumptions. More specifically, during the first step the generation of items is based upon the assumption of generating balanced data using a Gaus-sian distribution of mean 1 and variance 1. Then we generate the item difficulty parameter 𝛽𝑗 ∈ [0, +∞). During the second step that generates workers, we use two assumptions; the

first is that most of the items are labelled by only a few workers (to emphasize the variation on different workers’ expertise). The other is that the workers follow an exponential distribu-tion. Worker expertise 𝛼𝑖 ∈ (−∞, +∞) is being sampled assuming a Gaussian distribution

(30)

18 Chapter 4. Experimental Setup

Dataset Explanation

Pseudo_300 Create pseudo data with 300 labels (M=5, N=100), according to Algo-rithm4.1.

Pseudo_300_noise When 𝛾 is very low (𝛾 < 0.5), add large noise to crowd label. When 𝛾 is very high (𝛾 > 100), add small noise to crowd label.

TABLE4.1: An overview of the Pseudo dataset configurations

with mean 1 and variance 1. The reason we use mean equal to 1 is based upon the presump-tion that most of the workers usually give incorrect labels, and only a few of them are very competent. In order to generate the pseudo crowd labels we use the sigmoid function (3.12).

Moreover, we assume 𝐾 classes which are generated by a Gaussian mixture model of 𝐾 multi-dimensional Gaussian probability distributions, since it is an unsupervised way to nor-mally distribute clusters of labels within the overall dataset using soft classification. In our case we only use two different label classes; 0 or 1, focusing on decision-making tasks.

After sampling the true label which allows us to know the value of the parameters 𝑍, 𝛼 and 𝛽, we sample the parameters of worker competence, item difficulty and the crowd label. In order to indicate the sparsity of the crowd annotation, we make use of a parameter that defines the number of workers per annotated item.

In Algorithm4.1we present this process in a more detailed manner.

We use three different configurations of the synthetic pseudo data. In each one, the number of workers M=5 and the number items of varies among N=100, N=350, N=500.The sparsity of crowd annotation is equal to 3 workers labelling each item. That leads to three synthetic datasets configurations that have 300 labels, 1050 labels and 1500 labels respectively.

The annotation quality of each dataset configuration is measured by adding some noise to the crowd source labels when the value of this sigmoid function (Equation3.12) is close to 0 (i.e. adding large noise). Conversely, we add a small noise to any factor that makes this sigmoid function close to 1. For example, if we sample the crowd label from a Gaussian distribution that has a very high mean (e.g. mean is 100), then the probability that the crowd label is the same as the true label is closer to 1 and then we add small noise for 𝛾.

We want to compare the crowd labels with the corresponding true labels using the EM al-gorithm and later infer the true label. With this synthetic dataset configuration, we run the baselines as well as the GAMMA model and further explore the performance of the different models.

(31)

4.1. Datasets 19

FIGURE4.1: Synthetic Dataset Generation.

4.1.2 Real-World Datasets - Manual labels

As an additional step, we used real-world datasets that contain crowdsourcing labels as well as the ground truth labels. There are two types of real-world datasets that are being evaluated: (a) manual crowdsourcing labels and (b) a mixture of manual and automatic crowdsourcing labels [32].

The first dataset type contains crowdsourcing labels as well as the ground truth labels. We use two datasets named cs2010[1] and cs2011 [2]. All data folders contain the values of topic_ID, worker_ID, item_ID, true_label and crowd_label.

We further create various subsets of the two aforementioned datasets; We use random samples of smaller size, proportional to the sample size of the pseudo data configurations for better comparison. That is, a random sample of 360 labels, and another random sample of 650 labels, for each dataset. The reasoning behind this is to compare the performance with regards to different annotation budget (i.e. annotation and data size) of the cs2010 and cs2011 datasets. We do not study the impact of annotation quality on real-world data, only in the case of synthetic data. Table4.2shows the different configurations of the two datasets cs2010, cs2011.

4.1.3 Real-World Datasets - Mixture of manual and automatic labels

For the second dataset, we generated 7 new datasets based on the TREC (Text REtrieval Conference) ad-hoc tracks and web search tracks (TREC-5, TREC-6, TREC-7, TREC-8, TREC-9, TREC-10, TREC-11). In this task, all workers that participate in TREC are given a large collection of unlabelled documents along with a set of user queries and assess the relevance of the retrieved documents. In our project, the idea is to take the official query

(32)

Dataset Explanation #workers (M), #items (N) cs2010 Full cs2010 dataset with 20K labels M=722, N=3267 cs2010_random_sample_360 Random sample of cs2010 with 360 labels M=56, N=77 cs2010_random_sample_650 Random sample of cs2010 with 650 labels M=197, N=593 cs2011 Full cs2011 dataset with 2K labels M=181, N=710 cs2011_random_sample_360 Random sample of cs2011 with 360 labels M=71, N=301 cs2011_random_sample_650 Random sample of cs2011 with 650 labels M=99, N=217

TABLE4.2: An overview of the cs2010 , cs2011 datasets configurations

Dataset Explanation #workers (M), #items (N) trec5_10topics TREC-5 dataset subset of 10 topics M=61, N=610 trec6_10topics TREC-6 dataset subset of 10 topics M=74, N=740 trec7_10topics TREC-7 dataset subset of 10 topics M=103, N=1030 trec8_10topics TREC-8 dataset subset of 10 topics M=129, N=1290 trec9_10topics TREC-9 dataset subset of 10 topics M=104, N=1040 trec10_10topics TREC-10 dataset subset of 10 topics M=97, N=970 trec11_10topics TREC-11 dataset subset of 10 topics M=69, N=690 trec5_random TREC-5 subset dataset random sample M=61, N=798 trec6_random TREC-6 subset dataset random sample M=74, N=793 trec7_random TREC-7 subset dataset random sample M=103, N=747 trec8_random TREC-8 subset dataset random sample M=129, N=996 trec9_random TREC-9 subset dataset random sample M=104, N=989 trec10_random TREC-10 subset dataset random sample M=97, N=600 trec11_random TREC-11 subset dataset random sample M=69, N=597

TABLE4.3: An overview of the TREC5-11 datasets Configurations

relevances (qrel) as the ground truth labels, and take the runs submitted by different teams as crowdsourcing labels.

Due to the large size of these datasets (approximately 1 million labels), we used subsets of 10 topics within each TREC dataset, and also random samples of approximately 1000 labelled items. In Table4.3we can see in more detail the setup of TREC datasets.

4.2 Baselines

Here we discuss the methods that we used as baselines for the task of inferring the true label. The baseline methods were briefly introduced in2.1. A more thorough view on each specific method is presented here, along with the results of their comparison. Table4.4 gives an overview of all these methods tested on the aforementioned datasets.

The four methods will be denoted as EM (Dawid & Skene [8] ), MV (Majority Voting [13]), GLAD[35], and our newly proposed GAMMA model.

4.2.1 Majority Voting (MV)

This naive method is based on the major vote per item. It is used as proof of concept as it takes the answer given by majority workers as the truth. As explained in Section2.1, the main drawback of this method is that it assumes that all worker contributions are equal, making it

(33)

4.3. Evaluation Metrics 21

Method Merit Drawback Reference

Majority Voting (MV) Proof of concept Assumes each worker contributes the same

[13], [34], [3] Dawid & Skene (EM) First work on aggregating

worker labels and inferring true label

Naive method [8]

Generative model of Labels, Abilities, and

Difficulties (GLAD) Correctly infers true label Exploits worker expertise Exploits item difficulty

Handles one parameter per item/ worker

Slowest among baselines

[35]

GAMMA model Leverages correlation information

Not scalable to large datasets This thesis

TABLE4.4: An overview of the methods used in this report.

unreliable for real cases where worker levels may vary from an adversarial low-level worker to a very competent high-level worker that carefully labels each item.

4.2.2 Dawid & Skene (EM)

This method that was introduced in [8], uses the Expectation Maximization as their inference method. At the E-step it uses a uniform probability distribution to update the weights, and at the M-step it calculates the maximum likelihood estimates for the items of interest until convergence.

4.2.3 Generative model of Labels, Abilities, and Difficulties (GLAD)

GLAD model [35] is also using the EM algorithm to infer the true label, taking into account the values of worker expertise 𝛼𝑖 and item difficulty 𝛽𝑗. The likelihood function contains

logit sigmoid functions and for the M-step it uses gradient descent to update the values of 𝛼 and 𝛽.

4.3 Evaluation Metrics

The evaluation of label aggregation in terms of improved classification performance is mea-sured with various metrics. In this work, we are using metrics such as Accuracy, F1-Score and Mean Absolute Error.

Accuracymeasures the model’s ability to infer the true label. It is denoted as the fraction of items whose true label is inferred correctly. Given a method, we denote the inferred truth of item 𝑗 as ^𝑦𝑗*. The accuracy is then given by:

Accuracy= ∑︀𝑁

𝑗=1I𝑦^𝑗*=𝑦𝑗*

𝑁

Where N is the number of items.

We also use F1-Score to measure the performance of the model, in cases where the number of 0 labels is much larger than the number of labels equal to 1 (eg. Trec 5_11 datasets). In

(34)

such cases, even a very simplistic approach which will always label the items as 0, obtaining a very high Accuracy. However, we are interested in the same entities, (i.e. choice of label 1) and for that reason the F1-Score is more indicative. It is defined as the harmonic mean of Precision and Recall, and given by the following formula:

F1-Score= 2 ×precision × recall

precision + recall (4.1)

Where precision is the fraction of relevant instances among retrieved instances, given by:

precision = true positive

true positive + false positive (4.2)

And recall is the fraction of the relevant documents that are successfully retrieved, given by:

recall = true positive

true positive + false negative (4.3)

For the second experiment we compare the obtained values of 𝛼, 𝛽 and 𝛾, with the corre-sponding ground truth values, and evaluate the results using the formula of Mean Absolute Error(MAE):

MAE=

∑︀𝑁

𝑗=1|𝑦𝑗*− ^𝑦𝑗*|

𝑁

Moreover, we examine how stable are the MAE results, by measuring MAE’s variance for each model.

(35)

23

Chapter 5

Results and Discussion

In this chapter we will evaluate our newly proposed method, and answer RQ1by comparing its performance with the performance of three other existing methods [13], [34], [3], [35], [8]. We will investigate the impact of a qualification test on the performance of each method, to answer RQ2. Then, we will study the impact of different annotation budget on the per-formance of our method on both synthetic and real-world datasets, and we will examine how different annotation quality affects the performance of our model in the case of synthetic data, to answer RQ3. Finally, we will study how well our model learns its parameters of interest comparing with the GLAD baseline and answer RQ4.

We will implement a number of experiments on various synthetic and real-world datasets, and further analyse the findings.

This chapter is split into four sections, each one dedicated to the output results of each ex-periment and their analysis with respect to the research questions asked in Chapter1.

5.1 True Label Inference

In this first experimental setup, we measure the performance of the proposed model with regards to inferring the latent true label. This will help us answer RQ1.

For each dataset, there are 𝑘 collected answers per item. The number of all the items is 𝑁 . In each step, we select randomly a number 𝑟 ∈ [1, 𝑘], which means that we select 𝑟 out of the 𝑘 collected answers for one item. Then, we create a dataset that contains the randomly selected number of answers 𝑟 per item, for the total number of items 𝑁 . Finally, each created dataset will have a number of answers 𝑟 · 𝑁 .

This experiment is repeated R = 20 times. We measure overall Accuracy and F1-Score after comparing the inferred truth of each method with the ground truth label. Their values range between [0,1]. The higher each value is, the better performance we achieve.

Tables5.1,5.2show the mean Accuracy and mean F1-Score of each method.

For brevity we consider a selected number the synthetic dataset configurations introduced in

4.1.1.

In Table5.1, we notice that the proposed GAMMA model outperforms all the other methods. The same stands for the case of full cs2010, cs2011 real-world datasets as seen in Table5.2.

(36)

24 Chapter 5. Results and Discussion

Pseudo_1050 dataset Accuracy F1-Score EM 0.82257 0.80596 MV 0.82366 0.82301 GLAD 0.80419 0.78955 GAMMA 0.83201 0.82344

TABLE5.1: Accuracy and F1-Score of a selected synthetic dataset cs2011 dataset Accuracy F1-Score

EM 0.68492 0.79905 MV 0.70001 0.80985 GLAD 0.73999 0.82685 GAMMA 0.76574 0.85908

TABLE5.2: Accuracy and F1-Score of a selected real-world dataset

This verifies our original assumption as introduced in RQ1, that capturing the correlation 𝛾 among workers and items, shows improvement on both accuracy and F1-Score overall, regardless of the number of selected items, or the number of given answers per item. In terms of efficiency, GAMMA model is computationally more expensive than the other models. The reason for that is the model’s complex inference implementation during the M-Step, when computing the gradient of the auxiliary function Q. That makes it challenging to scale up to larger datasets comparing to the other baseline methods.

As mention above, for every item we record the number of labels given for this specific item. Then, we randomly select a number within this range, as explained in5.1, and observe the values of accuracy and Score for every different number of given answers per item. F1-Score provides a better overview. The purpose of this experiment is to evaluate not only the efficiency, but also the quality of each method.

This leads to the conclusion and therefore the answer of Research Question1that in both synthetic and real-world datasets, our proposed method outperforms in terms of accuracy and F1-Score the existing methods for the task of inferring the true label.

5.2 Qualification test

In this section we will examine the performance of each method when we make use of a qualification test, to answer RQ2. The purpose of the qualification test is to evaluate how accurately each model infers the true label, as well as how correctly it learns the parameters of interest. In real-life crowdsourcing platforms such as Amazon Mechanical Turk [4], each worker is requested to first answer a number of items which have a known ground truth value. This way we are able to estimate her answering performance and her quality for every item. However, this is not the case with the datasets being used in this project. We exclude the naive approach of Majority Voting from this experiment due its assumption that each worker has the same quality, which does not reflect the purpose of this experiment.

We want to estimate the quality of the models on new data. To address this problem we split the complete dataset into K-subfolders. Each subfolder has a random sample of workers’

(37)

5.2. Qualification test 25

answers for one item. There are cases where some items might not have answers from all workers, and to solve this we sample with replacement using bootstrap sampling [11], which is able to approximate the sampling distribution of the 𝛼 values. In the end, we have K subfolders with each one containing a random sample of given labels for different items. The reasoning behind this, is the following: if we assume a worker has performed some golden tasks before answering real tasks, and initialize the worker’s quality based on the worker’s answering performance for golden tasks, will this increase the quality of each method?

The same stands for the case of qualification test on the 𝛽 values, where we use a random sample of items annotated by one worker.We chose the number of K=20 folders, because it gives the best results without running into issues due to the computational expensiveness of our model.

Finally, we repeat the experiment for R = 20 times. Each time, we run GLAD and GAMMA on every sub-folder of the dataset. In the end, we average the Accuracy and F1-Score. We present the overall accuracy and F1-Score values of each method, which are calculated after comparing the inferred label with the ground truth label on all complete datasets. The results on synthetic datasets are shown in Table5.3. Real-world dataset results can be found at Table

5.4.

At Table5.3on synthetic datasets, it can be seen that all three methods (EM, GLAD, GAMMA) benefit from the qualification test comparing to the results of the previous experiment (Table

5.1). That is due to the sparsity of the annotations which is set to 3 workers answering each task. Such a low sparsity requires a qualification test in order to initialize better the quality of the workers. The GAMMA model presents higher accuracy and F1-Score in terms of in-ferring the true label, compared with EM and GLAD in all variations of annotation budget and annotation quality. Moreover, we notice that adding noise to the crowd label and thus the annotation quality of the data, benefits all different methods for N=100, but not for N=350.

The results of real-world datasets, can be found at Table5.4. We observe that the assumption of using a qualification test in order to improve inference of the true label, does not always bring significant improvement. The sparsity of the annotations, which measures the number of workers per annotated item, often gets large values as it is not equally distributed for all the workers. This could mean that the dataset can detect more efficiently the quality of the worker (or the item) in an unsupervised way, without the need of a qualification test. For the complete cs2010 and cs2011 datasets, GAMMA model is once again inferring the true label more accurately than the baselines.

Taking into consideration these results, we observe once again that the overall performance of the newly introduced GAMMA model to correctly infer the true label, is higher than the existing methods, which also answers RQ1.

As introduced in 4.1.3, we used variations of the TREC datasets, which are real-world datasets using a mixture of manual and automatic annotations. Regarding the experimental results, we observe that in both Experiment 1,5.1and Experiment 2,5.2, we always obtain almost perfect accuracy, but the F1-Score is close to zero. We assume that this is because when we sample some topics, we happen to sample documents with only zero values, so we

(38)

Pseudo_1050 dataset Accuracy F1-Score EM 0.84965 0.85601 GLAD 0.82489 0.82738 GAMMA 0.86857 0.87073

TABLE 5.3: Accuracy and F1-Score of a selected synthetic dataset using qualification test.

cs2011 dataset Accuracy F1-Score EM 0.70499 0.81103 GLAD 0.76440 0.86573 GAMMA 0.76627 0.86601

TABLE 5.4: Accuracy and F1-Score of a selected real-world dataset using qualification test.

sample all the “irrelevant" documents. Thus, the model predicts all the labels for all the items as zero, resulting to an accuracy of 1 and a zero F1-Score, since recall is 0. This case study does not reflect the property of the GAMMA model, since it is unfeasible to evaluate how well the model learn the parameters of the dataset.

5.3 Impact of annotation quality and annotation budget

In this Section, we will see what is the impact on each method’s performance when we change the amount of workers’ answers per each task. We want to examine the impact of different annotation budgets by testing a different number of annotations, and visualize the results for all four methods. We run the experiment on synthetic (subsection4.1.1) and real-world (subsection4.1.2) datasets, using a noise-free and a with-noise version of each dataset configuration. Hence, we observe whether annotation quality helps us increase the accuracy of inferring the true label. This will allow us to answer RQ3.

We generated several subsets of both synthetic and real-world datasets, with varying annota-tion quality and annotaannota-tion budget. We used samples of different (𝑀, 𝑁 ) size, with noise-free and with-noise versions of each sample, in order to evaluate the impact of annotation quality in each model. We used small-scaled data of maximum 20 thousand labels, since the complexity of our model’s inference method, makes it computationally expensive to scale up to very large datasets. The annotation quality is based upon the assumption that when the 𝛾 value is very low, we add large noise to the crowd label, and vise versa. The results are shown at Table5.6. We see that the GAMMA model benefits from the addition of noise, as well as the other methods on this particular dataset.

The representations of synthetic data include a noise-free and a with-noise version for every different set (N=100, N=350, N=500), when the number of workers is M=5. That leads to datasets of 300, 1050 and 1500 labels respectively. Figures5.1and5.2show the results on the synthetic datasets.

As observed at Table5.8, when we use a qualification test, none of these methods is benefited by the addition of noise, something which is contradictory to the results of Table5.6, which

(39)

5.3. Impact of annotation quality and annotation budget 27

(a) pseudo data (M=5, N=100) (b) pseudo data (M=5, N=100) with noise

(c) pseudo data (M=5, N=350) (d) pseudo data (M=5, N=350) with noise

(e) pseudo data (M=5, N=500) (f) pseudo data (M=5, N=500) with noise

FIGURE5.1: Experiment 1 - Accuracy on Synthetic Datasets

shows the results of the first experimental setup. Moreover, it turns out that the efficiency of the proposed model is influenced by small-scaled datasets of different annotation budget and annotation quality; an issue that was raised in RQ3.

Table5.7 shows that when we use a qualification test, the impact of increased number of annotations is not the same for all cases in terms of performance, and that is probably due to the sparsity of annotations or the different dataset configurations.

Figures5.2and5.3, provide a visual representation of Accuracy and F1-Score on a selected number of noise-free and with-noise real-world datasets of different annotation budgets. As seen at Figures5.1and5.2the proposed GAMMA model performs significantly better than the existing methods on all the variations of annotation quality and annotation budget. The left column of the figures contains the noise-free versions of each synthetic configuration,

(40)

(a) cs2010 full data (M=722, N=3267) (b) cs2010 data random sample 650 labels (M=56, N=77)

(c) cs2011 full data (M=181, N=710) (d) cs2011 data random sample 650 labels (M=99, N=217)

and the right column contains the with-noise version of the same configuration.

For the case of the real-world data, we see from Figures??and5.3, that there once again an overall preminence of the novel GAMMA model when compared to the baselines, for every different set of given answers per task RQ3.

To summarize, by using the GAMMA model in small scaled datasets, we manage to examine the impact of annotation budget on the different types of data configurations and to observe the effect of annotation quality on label aggregation problems, in order to answer RQ3. With respect to the annotation budget, the proposed GAMMA model gives the best results in both accuracy and F1-Score, for all different numbers of answers for each task, when compared to the other methods. It is also visible that increasing the number of annotations, is not always beneficial for the model, for both synthetic and real-world datasets. Table5.5 shows the scores obtained for all the different configurations of synthetic and real-world datasets. This provides more insights on the performance and the quality of each method, when we use less balanced datasets and of larger scale.

(41)

5.3. Impact of annotation quality and annotation budget 29

(e) pseudo data (M=5, N=100) (f) pseudo data (M=5, N=100) with noise

(g) pseudo data (M=5, N=350) with noise (h) pseudo data (M=5, N=350) with noise

(i) pseudo data (M=5, N=350) (j) pseudo data (M=5, N=500) with noise

FIGURE5.2: Experiment 1 - F1 Score on Synthetic Datasets

Dataset EM MV GLAD GAMMA

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score

Pseudo_300 0.76516 0.77497 0.78433 0.80662 0.74516 0.76831 0.78816 0.77786 Pseudo_1050 0.82257 0.80596 0.82366 0.82301 0.80419 0.78955 0.83201 0.82344 Pseudo_1500 0.74057 0.74331 0.71955 0.72601 0.7308 0.75383 0.75071 0.75664 cs2010 0.64289 0.70470 0.62255 0.70411 0.58717 0.70184 0.65976 0.74165 cs2010_rand_samp_360 0.75038 0.84956 0.87571 0.93407 0.89922 0.94517 0.88909 0.93840 cs2010_rand_samp_650 0.56235 0.68678 0.59789 0.67841 0.59877 0.67792 0.61011 0.70465 cs2011 0.68492 0.79905 0.70001 0.80985 0.73999 0.82685 0.76574 0.85908 cs2011_rand_samp_360 0.73014 0.84341 0.66162 0.78548 0.65514 0.78072 0.66171 0.78454 cs2010_rand_samp_650 0.73393 0.83303 0.72849 0.83768 0.71888 0.83215 0.78447 0.85761 TABLE 5.5: Experiment 1 - Results for different annotation budget on

(42)

(a) cs2010 full data (M=722, N=3267) (b) cs2010 data random sample 650 labels (M=56, N=77)

(c) cs2011 full data (M=181, N=710) (d) cs2011 data random sample 650 labels (M=99, N=217)

FIGURE5.3: Experiment 1 - F1 Score on Real-World Datasets

Dataset EM MV GLAD GAMMA

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score

Pseudo_300 0.76516 0.77497 0.78433 0.80662 0.74516 0.76831 0.78816 0.77786

Pseudo_300_noise 0.79331 0.72934 0.78680 0.73383 0.78250 0.70191 0.79379 0.74069 TABLE 5.6: Experiment 1 - Results of different annotation quality on

se-lected synthetic dataset (M=5,N=100).

Dataset EM GLAD GAMMA

Accuracy F1-Score Accuracy F1-Score Accuracy F1-Score

Pseudo_300 0.78714 0.81186 0.74238 0.76372 0.82381 0.84160 Pseudo_1050 0.84965 0.85601 0.82489 0.82738 0.86857 0.87073 Pseudo_1500 0.97846 0.78903 0.74437 0.75009 0.77475 0.78173 cs2010 0.70222 0.72404 0.56098 0.70724 0.70405 0.73041 cs2010_rand_samp_360 0.67903 0.80881 0.98453 0.99219 0.93135 0.96441 cs2010_rand_samp_650 0.59375 0.70519 0.59696 0.67849 0.61792 0.70165 cs2011 0.70499 0.81103 0.76440 0.86573 0.76627 0.86601 cs2011_rand_samp_360 0.84601 0.74261 0.77465 0.65052 0.64878 0.77231 cs2010_rand_samp_650 0.73557 0.83761 0.69629 0.81399 0.76081 0.86150 TABLE 5.7: Results for different annotation budget on synthetic and

Modelling task and worker correlation for crowdsourcing label aggregation

MS

A

I

M

T

Modelling task and worker correlation

for crowdsourcing label aggregation

I

S

June 19, 2020

Abstract

Acknowledgements

Contents

List of Figures

List of Tables

Chapter 1

Introduction

1.1

Motivation

1.2

Research Questions

1.3

Contributions

1.4

Outline of Thesis

Chapter 2

Related Work

2.1

Probabilistc models for label aggregation

2.2

Correlation approaches

2.3

Inference method

Chapter 3

Methodology

3.1

Preliminary: GLAD Probabilistic Model

3.2

Probabilistic Model

3.3

Inference

Chapter 4

Experimental Setup

4.1

Datasets

4.2

Baselines

4.3

Evaluation Metrics

Chapter 5

Results and Discussion

5.1

True Label Inference

5.2

Qualification test

5.3

Impact of annotation quality and annotation budget