Combining people's rankings : introducing three extensions to the Thurstonian model

(1)

Ravi Selker

1

_{, Michael D. Lee}

2 1 _{University of Amsterdam} 2 _{University of California Irvine}

Correspondence concerning this article should be addressed to: Ravi Selker

University of Amsterdam, Department of Psychological Methods Weesperplein 4

1018 XA Amsterdam, The Netherlands E-Mail should be sent to selker.ravi@gmail.com.

Abstract

In this paper we will introduce three extensions to the Thurstonian model. The Thurstonian model is a cognitive model that is used to combine people’s rankings of items by taking into account the process with which the individual rankings are generated (Lee, Steyvers, & Miller, 2014). Although this model performs well when people follow this process, the model does not take into account common deviations from this process. We developed three extensions to the Thurstonian model that deal with three sources of individual differences that the current model does not account for: 1) people differ in their expertise about every single item in the list, 2) people differ in their ability to retrieve items from their memory, and 3) people have different opinions. Using both simulated and behavioral data, we put the performance of each extension to the test. By incorporating these extensions, the Thurstonian model will be able to account for more sources of individual differences, hereby combining rankings in a better way.

Keywords: Thurstonian Model, Rank-Order, Wisdom of Crowds

People often collect information from different sources before making a decision. Dif-ferent sources can provide information about difDif-ferent dimensions of a problem. Before making the decision to buy a house you might ask several friends for advice. But while one friend is a realtor and can provide you with information about the investment poten-tial of the house, another friend lives in the neighborhood where the house is located and thus can provide you with information about the general living qualities in that

(2)

neighbor-hood. Because both sources provide information about different dimensions of the problem, there is no evident way to combine the information into one sensible answer to your prob-lem. Therefore, to combine information from different sources into one sensible answer it is important that the information focuses on one dimension of the problem. However, the information can still differ in its reliability. Before making the decision to organize a big neighborhood barbeque next week you might want to know whether it is going to rain that day. While both the next-door neighbor and the local weather channel provide you with information that is on the same dimension – is it going to rain next week, yes or no – it is likely that both sources of information differ in their reliability. Because the reliability of people’s estimates of the true answer to a question differs between people, the estimate will be prone to individual response bias. By combining information from a diverse group of people, the individual response bias will cancel each other out and thus lead to the true answer; an idea that is known as “the wisdom of crowds” (Surowiecki, 2005).

Galton (1907) was one of the first to use this principle when he visited a cattle fair for which a weight-judging competition was organized for which every contestant had to estimate the weight of an ox. At the end of the day, Galton collected all the tickets as he wondered what would happen if he would aggregate (i.e. combine) all the information on the tickets. What he found was that the median of all the estimates was only 4 kg off from the real weight of the ox, which was less off than the estimates on all 787 single tickets. An interesting observation was that the tickets were bought by both experts (e.g. ranchers, butchers) and non-experts and both groups contributed to the final aggregated answer. This introduced the idea that the true answer to a question lies in the aggregated estimates of a diverse group of people instead of the estimates of a small group of experts. People can express their knowledge and beliefs in numerous ways. A common and natural way of expressing your knowledge is by ranking a list of items. When you are looking to buy a house you might want to rank the houses you visited before you make an offer on the house that is on the top of your list. An advantage of ranking a list of items is that you do not need explicit knowledge about the scale you are ranking them on. You do not need to know the exact land mass of both China and the Netherlands to be able to tell that China is a bigger country than the Netherlands. Because you only need implicit knowledge of the scale it is also easier to rank items on a scale that is somewhat vague. It is often easy to tell who is the better teacher, but it is more difficult to rate both teachers on a “teaching quality” scale. Another advantage is that rankings are less prone to certain response bias than explicit ratings like the Likert scale (Cheung & Chan, 2002; Van Herk, Poortinga, & Verhallen, 2004; Murphy, Jako, & Anhalt, 1993). We will focus on the aggregation of ranking data in this paper.

Most models that are used to aggregate data can be divided into two classes of mod-els: statistical models and cognitive models. While statistical models are purely used to describe the mathematical relation between the individual estimates and the aggregate, cog-nitive models are used to describe the actual process with which individuals generate their estimates. A cognitive model has the advantage that it can easily incorporate deviations from the assumed process with which individuals generate their data, for example by mod-eling the individual differences in memory-retrieval ability. Also, a cognitive model allows people to make inferences about parameters other than the aggregated results that are still

(3)

essential in the decision-making process. However, a key aspect of model that is used to aggregate data is that it makes correct inferences about the aggregated result. Therefore a cognitive model needs to perform just as well or better than the statistical model to be more useful than the statistical model. A cognitive model that performs well for the aggregation of ranking data is the Thurstonian model (e.g. Steyvers, Miller, Hemmer, & Lee, 2009; Lee, Steyvers, De Young, & Miller, 2012; Lee et al., 2014).

Thurstonian Model

The Thurstonian model assumes that all ranked items can be placed on a latent (non-observable) continuous scale that describes the relative true position of each item (Thurstone, 1927). Individuals estimate the position of the items on this scale and subse-quently base their rankings on these estimates. For instance, the landmass of every country can be placed on one continuous scale. When someone has to rank a number of countries, the model assumes that they will first make an internal estimate of the landmass of all countries and subsequently rank the countries accordingly. The closeness of the individ-ual estimates to the true position is dependent on the individindivid-uals expertise; a geography teacher is expected to do better at ranking countries based on their landmass than a random layman.

Figure 1 provides a clear illustration of the Thurstonian model. Panel A shows the true position of each item on the latent continuous scale – µ1, µ2, µ3. Every individual makes

a latent estimate of this latent ground truth; this estimate is dependent on the latent ground truth, µi, and the standard deviation of the estimate of the individual, σj. As the standard

deviation gets lower, the probability that an individual’s estimate will be close to the ground truth also gets lower. Therefore, the standard deviation of an individual is an indication of the expertise of that person; Lee et al. (2012) found that a person’s standard deviation as inferred by the Thurstonian model is highly correlated with actual task performance. Panel B shows the process that leads to the latent estimates of individual 1 and panel C shows the process that leads to the latent estimates of individual 2. Because individual 1 has a lower standard deviation than individual 2, his latent estimates – x11, x21, x31– are closer to

the ground truth than the latent estimates – x12, x22, x32 – of individual 2. In the end, the

differences in standard deviation lead to different observed rankings, yj. The Thurstonian

model describes the underlying process that generates the data and is at the same time able to aggregate all individual rankings to one ranking; the ranking of the latent ground truths. We will use the Thurstonian model as the base model throughout this paper.

Lee et al. (2014) implemented this model using a Bayesian graphical modeling ap-proach. A graphical model represents the probabilistic process of how latent variables generate the observed data. Using Bayesian inference we can invert this model and use the observed data to infer the most likely values for the model parameters. Figure 2 shows the Thurstonian model using graphical model notation (see Lee & Wagenmakers, 2013, for an in-depth explanation of the graphical modeling approach). The graphical model no-tation describes the dependencies between variables (i.e. the nodes). The shaded nodes represent the observed data while the non-shaded nodes represent the model parameters, the squares represent discrete variables while the circles represent continuous variables, the double-bordered nodes represent deterministic variables (completely dependent on the value

(4)

A. Latent Ground Truth ● ● ● µ1 µ2 µ3 B. Individual 1 σ1 ● ● ● x11 x21 x31 C. Individual 2 σ2 ● ● ● x12 x32 x22 Observed Ordering y1=

(

1, 2, 3

)

Observed Ordering y2=

(

1, 3, 2

)

Figure 1. Illustration of the Thurstonian model. Panel A shows the latent ground truths (µ1, µ2,

µ3). Individual estimates of this truth depend on the standard deviation of an individual (σj). If

an individual has a low standard deviation, like individual 1, the estimate of the latent truth (xij)

is close to the latent truth, while if an individual has a higher standard deviation, like individual 2, the estimate of the latent truth is less close to the latent truth. By varying the standard deviation,

the observed ordering (yj) can differ between individuals.

yij xij σj µi i items j people µi ∼ Gaussian(0, 0.001) σj ∼ Uniform(0, 10) xij ∼ Gaussian(µi, 1/σj2) yij ← Rank(xij)

(5)

of another variable) while the single-bordered nodes represent stochastic variables, and the plates represent the hierarchical structure of the variables. Figure 2 illustrates that the la-tent estimates, xij, originate from a Gaussian distribution with a mean that is equal to the

latent ground truth of the item, and a standard deviation that is equal to the expertise of the individual. The individual observed rankings, yij, are subsequently acquired by ranking

the latent estimates.

Although the Thurstonian base model is shown to perform well for certain rank aggregation problems (Lee et al., 2014), the set of rank aggregation problems for which the model is suitable is still very limited. First of all, the Thurstonian base model only works when every individual ranks all items in the whole set of possible items. It becomes increasingly more difficult to rank all items as the number of items increases. For example, it might be possible to give a sensible ranking of the ten biggest countries, but after that it becomes more difficult to differentiate between countries. If you nonetheless would force people to rank all countries, this would introduce a lot of unnecessary noise to the individual latent estimates. A solution would be to let people only rank the top N highest ranked items on a list where N can differentiate between people. To accommodate this type of data, we developed a first extension of the Thurstonian base model: the Thurstonian top N model.

Secondly, the Thurstonian base model only works when all individuals get to see the predefined list containing all possible items before ranking them. In a lot of prediction tasks – where the true answer is not (yet) known – it is impossible to come up with such a predefined list. For example, if you want people to predict the top ten largest companies by revenue for the next year, the list of possible companies on that list is almost infinite. If you were to make a predefined list of companies, you would already exclude companies that could be in the true ranking. By not including a predefined list of possible items, another step is added to the decision-making process: remembering an item. Someone could not rank an item just because they did not remember the item. To account for this extra step in the decision making process we developed a second extension of the Thurstonian base model: the Thurstonian memory model.

Lastly, the Thurstonian base model only works for problems where you can assume that all people base their ranking on a single latent truth. This is a stringent assumption, even for questions where it is feasible to assume that there is a single latent truth. Let’s say you want people to rank the ten countries with the biggest population, a question that has a single latent truth. Most people would base their ranking on the actual population size of every country, but there might also be a small group that bases their ranking on a faulty heuristic strategy like the land mass of every country. If you were to assume that all rankings followed the same decision-making process, the rankings of the people who used the faulty heuristic strategy would be assumed to be derived from the same latent truth, something that would cloud the latent truth that is inferred by the model. Another case in which it would not be feasible to assume that all people base their ranking on a single truth is when there are actually multiple latent truths. Let’s say you are interested in who the best teacher is in your university. While some people might prefer teachers who are very enthusiastic about their curriculum, others might prefer teachers who are very lenient graders. In this case, you would genuinely have two latent truths. To make the Thurstonian

(6)

model suitable for these types of data, we developed a third extension of the Thurstonian base model: the Thurstonian shared truth model.

The outline of the paper is as follows. In the first section, we will introduce the Thurstonian top N model and discuss how the model accommodates top N data. Using both simulated data and real data, we will also compare the performance of the Thurstonian top N model with a statistical model that is often used to aggregate rank data: the Borda Count (Marden, 1996). In the second section, we will introduce the Thurstonian memory model and discuss how the model accommodates data for which the memory process had to be used. We will compare this model to the Thurstonian top N model using both simulated data and real data. In the third section, we will introduce the Thurstonian shared truth model and discuss how the model accommodates data for which the assumption of a single latent truth is too stringent. We will illustrate the performance of the model using both simulated data and real data. In the last section of the paper we will discuss the limitations of the three models, and will provide some possible extensions to deal with these limitations.

Extension 1: Thurstonian Top N model

The Thurstonian base model only works with data where every individual has ranked all the items in the whole set of possible items. However, as the set of possible items gets bigger, it becomes increasingly more difficult to give a sensible ranking to all items. There-fore, we developed an extension to this model that works with data for which individuals only ranked the top N highest ranked items: the Thurstonian top N model. Before we can discuss the changes that were made to the Thurstonian base model to get to the Thursto-nian top N model, it is important to understand how the ThurstoThursto-nian base model uses the ranking data to infer individual latent estimates in a little more detail.

The Thurstonian base model is able to infer values for the individual latent estimates because these estimates are censored by the observed rankings (Johnson & Kuhn, 2013). The way somebody ranked a list of items tells you something about its relative position compared to all the other items. Let’s assume someone ranked five items from highest ranked to lowest ranked – y11, y21, y31, y41, y51 – that correspond to five latent appraisals –

x11, x21, x31, x41, x51. Because item y21 is ranked higher than item y11 but lower than item

y31, the model assumes that the latent estimate x21 has to lie in the interval [x11, x31]; the

place of the parameter is censored because of its place in the ranking. Equation 1 shows how these restrictions are applied to all items in this particular set. Because there is no item that is ranked higher than the highest ranked item, y11, the corresponding latent estimate

x11 is not censored on the left, and because there is no items that is ranked lower than the

lowest ranked item, y21, the corresponding latent estimate x21 is not censored on the right.

−∞ < x11< x21 11< x21< x31 x21< x31< x41 x31< x41< x51 x41< x51<∞ (1)

(7)

When people are allowed to only rank the top N highest ranked items, this implies that all the items that are left unranked by an individual can be viewed as ranked lower than the lowest ranked item. You can apply the same censoring principle to this type of data as to fully ranked lists. Let’s assume someone ranked three items from highest ranked to lowest ranked and left two items unranked – y11, y21, y31, NA, NA. Because items y41and

y51are left unranked you can assume that that the corresponding latent estimates, x41 and

x51, have to be ranked lower than the lowest ranked item, y31. Equation 2 shows how these

restrictions are applied to all items in this particular set.

−∞ < x11< x21 x11< x21< x31 x21< x31<∞ x31< x41<∞ x31< x51<∞ (2)

To accommodate the top N data, the Thurstonian top N model uses the explicit cen-soring restrictions as inferred by the individual rankings to infer the latent ground truth. To test the performance of this new extension to the base model, we compared the perfor-mance of this model with the perforperfor-mance of the Borda Count method; a statistical model for aggregating rankings. The Borda Count method aggregates rankings by assigning points to all items according to their ranked position. Each item receives an amount of points that is equal to the total number of items minus the ranking position of that item. Thus, the highest ranked item receives most points, while the lowest ranked item receives least points. An item that is left unranked receives zero points. In the end, the total amount of points of every item is calculated after which the list of items is ranked based on these points. In order to be a useful model, the Thurstonian top N model should at least perform on par with the Borda Count method. Therefore, we compared the performance of the Thursto-nian top N model with the Borda Count method using both simulated data and behavioral data.

Simulation Study

We generated 100 datasets from the Thurstonian base model using the model specifica-tions displayed in Figure 2. For every dataset, we first randomly drew 20 latent ground truth values from a normal distribution – µi ∼ Gaussian(0, 0.001) – and 10 expertise values from a

uniform distribution – σj ∼ Uniform(0, 10). Subsequently, we used these values to randomly

draw the individual latent estimates from a normal distribution – xij ∼ Gaussian(µi, 1/σj2)

– and ranked these latent estimates to get to the observed rankings, yij. To get the top N

data, we subsequently randomly deleted the n lowest ranked items by randomly drawing a number for every individual ranking from a discrete distribution – nj ∼ Discrete(0 : 10).

This means that for every individual ranking in the dataset, a maximum of 10 and a min-imum of 0 ranked items were deleted. After generating the datasets, we drew posterior samples for every dataset from the Thurstonian top N model using the Bayesian sam-pling software JAGS (Plummer, 2003). We also derived the aggregated ranking for all 100 datasets using the Borda Count method.

(8)

Box 1: Kendall’s tau

The Kendall’s tau is a statistic with which the similarity of two rankings is determined. The method compares the relative position of all possible item pairs in the ranking. If two items are in the same order in two rankings (e.g. in both rankings item 1 is ranked higher than item 2) the pair gets a penalty of 0, and if two items are in the reversed order in two rankings (e.g. in one ranking item 1 is ranked higher than item 2, while in the other ranking item 2 is ranked higher than item 1) the pair gets a penalty of 1. The tau distance is equal to the sum of the penalties of all item pairs. A tau distance of 0 implies that two rankings are exactly the same and a higher Kendall’s tau value implies that two rankings are more dissimilar. However, the regular Kendall’s tau only works for fully ranked lists. For partial ranked data (e.g. top N data) we need to use the partial Kendall’s tau (Fagin, Kumar, & Sivakumar, 2003; Fagin, Kumar, Mahdian, Sivakumar, & Vee, 2006). The partial Kendall’s tau is able to compare data with tied items. This method also compares the relative position of all possible item pairs in the ranking. If two items are in the same order in both rankings, or if two items are tied in both rankings the pair gets a penalty of 0; if two items are in the reversed order in two rankings the pair gets a penalty of 1; if two items are tied in one ranking but not in the other ranking the pair gets a penalty of 0.5. The partial tau distance is equal to the sum of all the penalties of all item pairs.

We compared the Thurstonian top N model with the Borda Count method by com-paring the inferred aggregated ranking of both models. To get the aggregated ranking from the Thurstonian top N model, we ranked the posterior means of the latent ground truth µi. We tested the performance of both models by comparing the inferred rankings with the

true rankings – the rankings that were acquired by ranking the latent ground truths that were used to generate the 100 datasets – using the Kendall’s tau statistic. The Kendall’s tau is a frequently used method to compare the similarity of two rankings (Kendall, 1938). The lower the tau distance, the more similar rankings are (for a more elaborate explanation of this statistic, see Box 1). For each dataset, we compared both the Thurstonian top N ranking and the Borda Count ranking with the true ranking using the Kendall’s tau. The lower the tau distance, the better the performance of a model.

Figure 3 shows the tau distance between the Thurstonian top N ranking and the true ranking plotted against the tau distance between the Borda Count ranking and the true ranking for all 100 datasets; this can be interpreted as a comparison between the perfor-mance on both models where the line represents perfect linearity between the perforperfor-mance of both models. Because the dots are all very close to the line, we can conclude that the Thurstonian top N model performs on par with the Borda Count method for these datasets. Next, we will compare the performance of both models using data that was gathered by letting people rank the most anticipated movies of 2013.

Predicting Movie Grosses Using the Top N Model

To test the Thurstonian top N model on actual behavioral data, we acquired a dataset from the website www.ranker.com. Ranker.com is a website that specializes in gathering

(9)

● ●● ●●● ● ● ● ● ●●●

●

● ● ● ●_●●●●

_●

●● ● ●_●●● ● ●●●●●● ● ●_●●●

●

●●●● ●●●●●● ● ● ● ● ●_●● ● ● ● ● ●

●

● ● ● ● ● ●_●●●●●●●●●● ●_●●

●

● ● ●● ● ● 60 80 100 120 60 80 100 120

Kendall's τ Top N Model

K

endall'

s

τ

Borda Count

Figure 3. A comparison between the performance of the Thurstonian top N model and the Borda

Count method. A hundred datasets were generated from the Thurstonian top N model. Subse-quently, the Kendall’s tau statistic was calculated to check how similar both the predicted ranking of the top N model and the predicted ranking of the Borda Count method were to the true ranking that was used to generate the data.

online rankings about a diverse number of questions, both prediction-based and opinion-based. The website gathers rankings by posing a question and letting users make their top N ranking. To give their personal top N ranking, the users can choose from a predetermined list of items. Because most users only rank a certain amount of items, the data from this website is very suitable to be analyzed with the Thurstonian top N model. We chose a dataset that let 28 people rank the top N “most anticipated movies of 2013”. After removing the movies that were not released in 2013, we ended up with 51 ranked movies.

We drew posterior samples for the latent ground truth parameter using the Thursto-nian top N model. Figure 4 shows the marginal posterior distributions of the latent ground truth, µi, of the 20 highest ranked movies. The posterior distributions provide information

about the relative position of all 20 movies and the uncertainty of every estimate. Because the rankings were gathered in 2012, the inferred true ranking can be interpreted as a pre-diction of the list of most popular movies the next year. The performance of both models can be inferred by comparing the predicted ranking of both models with the true ranking; the popularity of the movies as indicated by their US gross income. To acquire the true ranking we looked up the gross income of all movies on the website www.imdb.com.

The Thurstonian top N model’s predicted ranking was slightly closer to the ranking of the movies’ US gross income (tau distance = 3291) than the Borda Count method’s predicted ranking (tau distance = 339). Thus, for these behavioral data, the Thurstonian

1

The Thurstonian model produces a distribution of tau distances. This number is the mean of that distribution.

(10)

Kick−Ass 2 The Hangover Part IIIFurious 6 Ender's Game Despicable Me 2 A Good Day to Die Hard47 Ronin G.I. Joe: RetaliationWorld War Z Monsters UniversityThe Lone Ranger Oz: The Great and Powerful The Hunger Games: Catching FirePacific Rim The Wolverine Thor: The Dark World Star Trek Into Darkness The Hobbit: The Desolation of SmaugIron Man 3 Man of Steel

Model Inferences

Figure 4. The marginal posterior distributions of the 20 highest ranked latent ground truth

pa-rameters µi inferred by the Thurstonian top N model using data that was acquired by gathering

rankings to the question “what is the most anticipated movie of 2013?”.

top N model performed slightly better than the Borda Count method.

Discussion

The comparison of the Thurstonian top N model and the Borda Count method using both simulated and behavioral data showed that the top N model performs on par or even slightly better than the Borda Count method when confronted with top N data. Of course, we only looked at one behavioral dataset and to be more confident in the performance of the Thurstonian top N model it is important to test the model on other behavioral datasets in the future. While the Borda Count is a statistical model, the Thurstonian top N model is a cognitive model. We prefer the Thurstonian top N model because it is a description of the actual decision-making process. Therefore, it can tell us something about parameters other than the predicted ranking (e.g. the expertise of every individual) and also can easily take into account deviations of the assumed decision-making process, something we will see in the next section where we will introduce the Thurstonian memory model.

Extension 2: Thurstonian Memory Model

In the previous section, we extended the Thurstonian base model so it could work with top N data. However, this model still assumes that there is a predefined list of items to rank. It is often impossible to come up with such a predefined list in prediction tasks as the true answer is not (yet) known. The problem of ranking items without having a predefined list is that you first have to remember all items before you can try to rank them. To account for this extra step in the decision-making process we developed an extension to the Thurstonian top N model: the Thurstonian memory model.

Combining both the memory process and the top N process leads to an apparent problem about the nature of items that were not ranked by an individual: was the item not ranked because he did not remember the item (the memory process) or because he ranked

(11)

this item lower than the lowest ranked item (the top N process). Both processes lead to different censoring restrictions. While the censoring restrictions of the top N process are described in the previous section (see Equation 2), we did not yet describe how the memory process censors the latent estimates.

Let’s assume someone ranked five items from highest ranked to lowest ranked if he had remembered all of them. However, he forgot two items – y11 and y31 – which led to

the ranking: NA, y21, NA, y41, y51. While y11 and y31 would have been ranked first and

third if they were remembered, they are now left unranked. Because he unintentionally left out these two items, you cannot infer any censoring restrictions about these two items. Equation 3 shows how these restrictions are applied to all items in this particular set. The difference between items that are left unranked because of the memory process and items that are left unranked because of the top N process is that while the top N process provides information about the unranked items, the memory process does not. Treating all unranked items as unranked due to the memory process could lead to a potentially big loss of information while treating all unranked items as unranked due to the top N process could lead to biased information. A good model would be able to differentiate between these two processes. Therefore, we attempted to develop a model that is able to differentiate between the memory process and the top N process.

−∞ < x11<∞ −∞ < x21< x41 −∞ < x31<∞ x21< x41< x51 x41< x51<∞ (3)

Figure 5 shows a graphical representation of the Thurstonian memory model. Just as the Thurstonian base model, it illustrates the Thurstonian process with which the la-tent ground truth leads to the individual rankings. However, this model also includes a parameter that infers whether an unranked item goes through the Thurstonian process and is left unranked because it was ranked lower than the lowest ranked item, mij = 1, or is

unranked because the item was not remembered, mij = 0; all ranked items are assumed to

be generated through the Thurstonian process. This parameter depends on the combination of a person dependent parameter qj and an item dependent parameter pi; some people are

better at remembering items than others, and some items are more easily remembered than others. Through the use of a Rasch model equation (Bond & Fox, 2001), these two param-eters are combined in a parameter that describes the total probability of remembering an item: θij.

The Thurstonian memory model describes a two-step process that infers 1) whether an item is remembered, and 2) how the remembered items lead to the observed ranking. The idea is that differentiating between the two different processes that lead to unranked items, leads to a less biased latent ground truth. Therefore, when remembering the items in the list is a genuine issue, the Thurstonian memory model should perform better than the Thurstonian top N model. We will test this proposition using both simulated data and behavioral data.

(12)

yij xij σj µi mij θij pi qj i items j people pi ∼ Gaussian(−2, 1) qj ∼ Gaussian(2, 1) θij ← e pi−qj 1+epi−qj mij ∼ Bernoulli(θij) µi ∼ Gaussian(0, 0.001) σj ∼ Uniform(0, 10) xij ∼ Gaussian(µi, 1/σj2) if mij= 1 yij ← Ranknj(xij) up to nj

Figure 5. Graphical representation of the Thurstonian memory model.

Simulation Study

We generated 300 datasets from the Thurstonian base model in the same way as the simulation study in the Thurstonian top N model section. Subsequently, we divided the datasets into three groups each containing 100 datasets in which we varied the ratio of items that were not ranked because of the top N process and items that were not ranked because of the memory process. For the datasets in the first group we deleted the n lowest ranked items by drawing a random number for every individual ranking from a discrete distribution – nj ∼ Discrete(0 : 10); these datasets only contained items that

were not ranked because of the top N process. For the datasets in the second group we deleted the n lowest ranked items by drawing a random number for every individual ranking from a discrete distribution – nj ∼ Discrete(0 : 5), and we also deleted k random items,

representing the process of forgetting random items, by drawing a random number from a discrete distribution – kj ∼ Discrete(0 : 5); for these datasets, 50% of the items that were

not ranked, were not ranked because of the top N process and 50% of the items that were not ranked, were not ranked because of the memory process. For the datasets in the third group we deleted k random items by drawing a random number for every individual ranking from a discrete distribution – kj ∼ Discrete(0 : 10); these datasets only contained items that

were not ranked because of the memory process. Although the datasets differed in the ratio of items that were not ranked because of the top N process and items that were not ranked because of the memory process, the total amount of unranked items was the same. Table 1 gives an example of how the data looks before and after including the two types of unranked items.

We drew posterior samples for the latent ground truth parameter using both the Thurstonian top N model and the Thurstonian memory model. We tested the performance of both models for all three group of datasets by comparing the rankings inferred by both models with the true rankings using the tau distance. The results of this comparisons are represented in Figure 6. The left figure shows how well both models performed for the different types of datasets. While the Thurstonian memory model performs equally

(13)

y11 y21 y31 y41 y51 y61 y71 y81 y91 y101

Full Ranking 1 2 3 4 5 6 7 8 9 10

Top N Ranking 1 2 3 4 5 NA NA NA NA NA

Memory Ranking NA 1 2 NA 3 NA NA NA 4 5

Table 1: An example of the ranking of 10 items. The full ranking shows the ranking of the items before missing data was introduced, the top N ranking shows the ranking of the items after top N missing data was introduced, and the memory ranking shows the ranking of the items after memory missing data was introduced.

well for all types of datasets, the Thurstonian top N model performs increasingly worse as more unranked items are not ranked because of the memory process. The difference in performance is also apparent in the right figure. Here, the tau distance between the Thurstonian memory model and the true ranking is plotted against the tau distance between the Thurstonian top N model and the true ranking for all 300 datasets. The dots in the white area represent the datasets for which the top N model performed better, and the dots in the grey area represent the datasets for which the memory model performed better. The abundance of dots in the gray area shows that the the Thurstonian memory model

● ● ● ● ● ● 0 5 10 15 50/0 25/25 0/50

% Missing Top N/% Missing Memory

Mean K endall' s τ Model Memory Top N ● ●●● ● ● ● ●● ●

●

●●● ● ●● ● ●

●

●● ● ● ● ● ● ● ●●●● ●●● ●●●●● ●● ● ● ● ●●● ● ●● ● ● ● ●● ●●● ● ● ●●● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 10 20 30 40 0 10 20 30 40

Kendall's τ Top N Model

K endall' s τ Memor y Model

Figure 6. The performance of the top N model and the memory model fitted on 300 simulated

datasets with different types of missing data; one type with only top N missing data, one type with half top N missing data and half memory missing data, and one type with only memory missing data. The left figure shows the mean Kendall’s tau for both models fitted on the three different types of datasets. The right figure compares the Kendall’s tau of both models for all 300 datasets. The gray area indicates a better performance for the memory model while the white area indicates a better performance for the top N model.

(14)

perform better in general for all 300 datasets together than the Thurstonian top N model. This simulation study showed that the Thurstonian memory model can account for more different types of data than the Thurstonian top N model. Next, we will compare the performance of both models using data that was gathered by letting people rank the top 10 biggest US food chains.

The Biggest Food Chains in the US

To test the Thurstonian memory model on behavioral data, we did an experiment in which we gathered rankings for a number of general knowledge questions. A total of 20 participants who completed the experiment were recruited through MTurk (Buhrmester, Kwang, & Gosling, 2011). We chose the rankings on one particular question, as this ranking illustrated the strengths of the model the best. For this question all participants were asked to rank the top 10 biggest US food chains. The participants were not given a list of possible food chains, so they had to recall all the food chains from their memory. The experiment

● ● ● ●●● ● ●●● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● 0 10 20 30 0 10 20 30

Predicted Ranking

T

rue Ranking

Top N Model

●● ● ●●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ●● ● ● 0 10 20 30 0 10 20 30

Predicted Ranking

T

rue Ranking

Memory Model

80 120 160 200

Kendall's

τ

Predictive Validity Both Models

Figure 7. A comparison between the true ranking that belongs to the question “what are the 10

biggest US food chains” plotted against both the predicted ranking of the top N model (upper left) and the predicted ranking of the memory model (upper right). The bottom figure shows the same comparison, but now summarized through the use of the Kendall’s tau. The dark shaded distribution represents the Kendall’s tau distribution of the memory model, while the light shaded distribution represents the Kendall’s tau distribution of the top N model.

(15)

resulted in 20 individual rankings and 36 unique items.

We drew posterior samples for the latent ground truth parameter using both the Thurstonian memory model and the Thurstonian top N model. Figure 7 shows the predicted ranking of the top N model plotted against the true ranking of US food chains on the upper left and the predicted ranking of the memory model plotted against the true ranking of US food chains on the upper right. The line in both figures represents a perfect linear relation between the predicted ranking and the true ranking. The figures show that there is less variation around this line for the memory model than for the top N model. This indicates that the memory model has decreased the bias in the prediction of the top N model. The bottom figure presents the distribution of the tau distances of both models. It shows that the tau distance between the memory model predicted ranking and the true ranking is smaller than the tau distance between the top N model predicted ranking and the true ranking. Therefore, we can conclude that the memory model gives a better prediction of the ground truth than the top N model for this question.

Discussion

The simulation study showed that the Thurstonian memory model is more versatile than the Thurstonian top N model, taking into account different processes that could be responsible for items that are not ranked. The behavioral data study illustrated the strength of the Thurstonian memory model compared to the Thurstonian top N model using an example dataset. For this dataset, the memory model was able to decrease the bias that was present in the estimation of the latent ground truth by the top N model. However, before we chose this example dataset, we tested the performance of both models on other datasets, and the memory model did not always give the best prediction out of the two models. Because the actual memory process is a lot more complicated than the process in the current memory model, the model is still sensitive to deviations from the assumed process. Although the current memory model is a good first step towards describing various types of data, there is still a lot of work to do. One assumption of this model that is not always tenable is that all individual latent estimates are derived from one latent ground truth. In the next section, we will present an extension to the Thurstonian base model that allows you to drop that assumption.

Extension 3: Thurstonian Shared Truth Model

All models that were discussed up to this point assume that people base their ranking on a single latent truth. This is a stringent assumption for general knowledge and predic-tion tasks as you cannot rule out the possibility that some people start off by using the wrong latent ground truth due to faulty heuristic strategies. Moreover, this is a stringent assumption when trying to aggregate rankings to opinion-based questions. Inherent to an opinion is that it is not shared by everyone in the population. Therefore, a model that assumes there to be one opinion is not appropriate for these types of questions. To make the Thurstonian model more accessible to a wider range of questions we have developed a model that allows there to be multiple shared truths: the Thurstonian shared truth model. An approach that is designed to allow multiple shared truths in data aggregation prob-lems is Cultural Consensus Theory (CCT; Batchelder & Romney, 1988; Romney, Weller, &

(16)

yij xij σj µij zj ψig i items j people g groups ψig ∼ Gaussian(0, 0.001) zj ∼ Categorical(_G1, . . . ,_G1) µij ← ψizj σj ∼ Uniform(0, 10) xij ∼ Gaussian(µij, 1/σj2) yij ← Rank(xij)

Figure 8. Graphical representation of the Thurstonian shared truth model.

Batchelder, 1986). CCT was first used to find answer patterns that are representative for certain cultures. We implemented this idea in the Thurstonian shared truth model. Figure 8 shows a graphical representation of this model. The model starts off with the shared latent ground truth, ψig, that dependents on the group, g, a person is in. Because the model lets

you choose the number of latent truths you want to assume, you can also decide to assume that there is just a single latent ground truth in which case the Thurstonian shared truth model regresses back to being the Thurstonian top N model again. The group membership parameter, zj, determines on which shared ground truth an individual bases their latent

estimates. Subsequently, the model uses the Thurstonian process of the Thurstonian base model to generate the individual observed rankings.

We will first test whether the Thurstonian shared truth model is able to differenti-ate between different shared ground truths using a simulation study. Next, we will check whether the model can come up with sensible results using behavioral data in which it is likely that there are multiple shared truths.

Simulation Study

We generated three datasets from the Thurstonian shared truth model; one for which we assumed there to be one latent ground truth, one for which we assumed there to be two latent ground truths, and one for which we assumed there to be three latent ground truths. For all datasets we assumed that there were 20 items that were all ranked by 40 people. Dependent on the number of ground truths, we either drew 20, 40, or 60 latent ground truth values from a normal distribution – φig ∼ Gaussian(0, 0.001). Subsequently, dependent on

the number of assumed ground truths we drew a random sample from a discrete distribu-tion – zj ∼ Discrete(1 : g) – to determine the group membership. We generated observed

(17)

rankings in the same way we generated observed rankings in the simulation study in the Thurstonian top N model section, only now using the group membership to determine a person’s latent ground truth. Next, we drew posterior samples from every dataset using three models; one model that assumed there to be one ground truth, g = 1, one model that assumed there to be two ground truths, g = 2, and one model that assumed there to be three ground truths, g = 3.

Because all models have different assumptions about the number of latent ground truths, it is impossible to determine the performance of a model by comparing the pre-dicted ground truth(s) to the true ground truth(s). Instead we will compare how the three models perform using posterior predictive checks (Gelman, Meng, & Stern, 1996). Because the Thurstonian shared truth model is a generative model, it allows us to predict how the data should look like given the parameter posteriors that were found. If the model de-scribes the data adequately, the predicted data should be very similar to the actual data. The comparison between the data and the predicted values for the data is called a pos-terior predictive check. However, the pospos-terior predictive check does not incorporate a proper penalty for model complexity. In general, the more truths you will introduce to the Thurstonian shared truths model, the more variability in the data it is able to explain. To account for model complexity we will therefore use the principle of parsimony which says that when two models make similar predictions, you should choose the less complex model (Myung, 2000). So, if the model that assumes two truths describes the same amount of variability in the data as the model that assumes one truth, we would choose the model that assumes one truth.

Figure 9 shows the posterior predictive checks for the three different datasets (hor-izontal) using the three different models (vertical). If the dots are close to the line, this implies that the model is able to explain a lot of variability in the data. The top of the figure shows the posterior predictive checks of the three models fitted on data that was generated under the assumption that there was a single ground truth. All three models seem to describe the same amount of variability for this dataset. Using the principle of parsimony, we would therefore choose the one truth model as the correct model. The mid-dle of the figure shows the posterior predictive checks of the three models fitted on data that was generated under the assumption that there were two ground truths. The posterior predictive check of the model that assumes one truth seems to describe significantly less variability than the other two posterior predictive checks, while the posterior predictive checks of the models that assume two and three truths seem to describe the same amount of variability. Therefore, we would choose the two truths model as the correct model. The bottom of the figure shows the posterior predictive checks of the three models fitted on data that was generated under the assumption that there were three ground truths. Because the posterior predictive checks of the models that assume one truth and two truths both seem to describe significantly less variability than the posterior predictive check of the model that assumes three ground truths, we would choose the three truths model as the correct model. Using the posterior predictive checks and the principle of parsimony, we would have chosen the correct model for every dataset. This simulation study shows that the Thurstonian shared truth model is able to pick up on the true number of ground truths. However, it is important to check whether the ground truths that the model picks up on are sensible. Therefore, we fitted the Thurstonian shared truth model on behavioral data from which

(18)

Obser

v

ed Ranking

Model Assumption

1 Truth 2 Truths 3 Truths

●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● 5 10 15 20 5 10 15 20 ●●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●●● 5 10 15 20 5 10 15 20 ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● 5 10 15 20 5 10 15 20 ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 5 10 15 20 5 10 15 20 ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●●● ● ● 5 10 15 20 5 10 15 20 ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●●● ● ● 5 10 15 20 5 10 15 20 ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● 5 10 15 20 5 10 15 20 ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● 5 10 15 20 5 10 15 20 ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ●● ● ● 5 10 15 20 5 10 15 20

Predicted Ranking

1 Truth 2 Truths 3 Truths

Data Gener

ated Using

Figure 9. A comparison between the observed data that was generated under three different

assumptions (one truth, two truths, and three truths) and the predicted data under the models with three different assumptions (one truth, two truths, and three truths).

you would expect a certain differentiation in ground truths.

The worst president of the USA?

To test whether the Thurstonian shared truth model is able to pick up on sensible ground truths when there are multiple ground truths, we acquired a dataset from the website ranker.com in which we would expect a certain trend in the different ground truths. We chose a dataset that let 99 people rank the top N worst Presidents of the USA. Because the answer to this question would seem to depend on everybody’s personal political preference and because the USA has a two party system, you could expect to find at least two latent ground truths for this dataset. To compare the inferences of the one truth model with the inferences of the two truths model, we drew posterior samples from both models.

(19)

Model Inferences

●●●●●●●●●●●● ●●● ● ● ●●●●●● ●● ● ● ●● ● ● ● ● ● ●●●● ●●● ●●●● ● ● ●●●●●● ●●●●● ●●● ● ●●●●●●● ●●● ● ● ●● ●● ●● ● ● ● ●●●●●●●● ● ● ● ● ● ● ● ● ● ●●●●●●●●●●●●●●● ● ●●●●● ● ●●●●●● ● ● ● ● ● ●● ●●● ●●_●●● ●●● ● ● ● ● ●_●●●●● ●●●●●● ● ● ●●●●●●●●●●●●● ● ● ●●●_●●●●●●●● ●●● ●●●● ● ● ● ● ● ●●●●● ●● ● ● ● ●● ● ● ● ●●●●●●●●●●● ● ●● ● ● ●●●●●●●●●●● ●● ● ● ● Benjamin Harrison William Henry Harrison

Ulysses S. Grant John Tyler Martin Van Buren

Franklin Pierce Millard Fillmore Jimmy Carter Barack Obama Richard Nixon Herbert Hoover James Buchanan Warren G. Harding George Bush Andrew Johnson

Observed Data

Figure 10. The top 15 worst Presidents of the USA inferred by the one truth model. The left figure

shows the marginal posterior distributions of the 15 highest ranked latent ground truth parameters

µi’ The right figure shows the proportion of times each President was ranked in each position.

Figure 10 shows the 15 highest ranked Presidents as inferred by the one truth model. The “model inferences” panel shows the marginal posterior distributions of the latent ground truth of the 15 Presidents and the “observed data” panel shows the proportion of times each President was ranked in each position. This top 15 consists out of both highly ranked liberal Presidents (i.e. Jimmy Carter, Barack Obama) and highly ranked conservative Presidents (i.e. George Bush, Andrew Johnson). If people would base their individual estimates on this latent ground truth it would seem that political preference does not play a role. However, using the two truths model we can easily check whether it makes more sense to assume multiple shared ground truths.

Figure 11 shows the 15 US Presidents that are ranked highest by group 1 on the top and the 15 US Presidents that are ranked highest by group 2 on the bottom. The “model inferences” panel shows the marginal posterior distributions of both the latent ground truth of group 1 in red and the latent ground truth of group 2 in blue. The “observed data” panel shows the proportion of times each President was ranked in each position separately for group 1 – in red – and group 2 – in blue. It seems that the model picked up on two distinctive shared truths. While the top 5 of group 1 consists out of Presidents that are considered to be more towards the liberal side of the spectrum, the top 5 of group 2 consists out of Presidents that are considered to be more towards the conservative side of the spectrum. Therefore, you could infer that there is a conservative group of people who rank the liberal Presidents high in group 1 and a liberal group of people who rank the conservative Presidents high in group 2. Of course, this is a subjective and probably biased way of putting names to the groups. However, it is clear that the two truths model is able two differentiate between two qualitatively different latent ground truths.

Discussion

The simulation study showed that the Thurstonian shared truth model is able to pick up on different number of ground truths in the data. We also introduced a way of doing a

(20)

Model Inferences

John Tyler Franklin Pierce Millard Fillmore Andrew Jackson Warren G. HardingHerbert Hoover

Andrew JohnsonRichard Nixon Woodrow Wilson James Buchanan Franklin D. RooseveltLyndon B. Johnson

Bill Clinton Jimmy Carter Barack Obama

Observed Data

Top 15 group 1

Benjamin HarrisonBarack Obama Ulysses S. Grant William Henry HarrisonJimmy Carter

John Tyler Martin Van BurenFranklin Pierce

Millard FillmoreRichard Nixon Herbert Hoover Warren G. HardingJames Buchanan

Andrew JohnsonGeorge Bush

Top 15 group 2

Figure 11. The top 15 worst Presidents of the USA inferred by the two truths model. The top two

figures show the inferred top 15 by the first group, while the bottom two figures show the inferred top 15 by the second group. Both left figures show the marginal posterior distributions of the 15 highest

ranked latent ground truth parameters µiand both right figures shows the proportion of times each

President was ranked in each position. The red shaded distributions/dots represent the parame-ters/data of the first group, while the blue shaded distributions/dots represent the parameparame-ters/data of the second group.

preliminary test to check how many ground truths the model should assume. The behavioral data illustrated how the Thurstonian shared truth model can pick up on two really different ground truths that are also both interpretable. However, differentiating between the number of truths you should use for your model is not as clear as in the examples. In real datasets, there will be more noise present making it much harder to decide which model to choose using the posterior predictive check. Also, if the posterior predictive check keeps showing a lot of variability, even after you added a number of ground truths, do you keep adding ground truths until there is no improvement in variability anymore? The ambiguity of the posterior predictive checks shows that we need to work on a better way of model comparison in the future, for example by developing a Bayes Factor comparison between two models with different number of ground truths.

General Discussion

In this paper, we introduced three extensions to the Thurstonian base model. Every extension makes the Thurstonian model suitable for a new type of data. The first extension, the Thurstonian top N model, allows people to rank only the N highest ranked items. We

(21)

showed that the model’s performance is on par with the performance of the Borda Count – a statistical model for aggregating rankings. The second extension, the Thurstonian memory model, tries to estimate whether items are remembered or not, making the model suitable for rankings of people that had no access to a predefined list of items. We showed that once memory retrieval becomes part of the decision-making process, the Thurstonian memory model performs better than the Thurstonian top N model. The third extension, the Thurstonian shared truth model, allows there to be multiple ground truths. We showed that the model can pick up on the number of ground truths in the data, and that the model is able to infer multiple interpretable ground truths.

Although the three extensions introduce new ways of dealing with certain individual differences, they can still be improved in several ways. The Thurstonian memory model introduces a Rasch model to accommodate the memory retrieval process. While this is a good first step in taking into account the memory retrieval process, the Rasch model is a statistical model and not a cognitive model; the model does not describe the actual process with which items are retrieved from memory. Therefore, the parameters in the model lack a meaningful interpretation. Because the Rasch model does not describe the underlying process, it is difficult to extend the model to include more complex memory retrieval processes. That is why to further advance the Thurstonian memory model, a cognitive model that describes the memory retrieval process should be incorporated instead of the Rasch model.

Another extension that could be improved is the Thurstonian shared truth model. The Thurstonian shared truth model is able to estimate multiple latent truths. To infer the number of truths in the data we proposed to compare models that assume a different number of truths using a posterior predictive check. Although this procedure could still be improved using other model comparison methods that take into account model complexity in a better way (e.g. the Bayes Factor), it would be best if the model itself estimates the number of truths in the data. Because cognitive models are not able to deal with adaptive problems like these, we should incorporate machine learning techniques. A way to estimate the number of truths in the data using the Thurstonian model would be by combining the cognitive modeling approach with the discriminative machine learning approach (Bishop & Lasserre, 2007).

The Thurstonian base model introduced a way of combining rankings from different sources to find the true answer to a question. Because the Thurstonian model explicitly de-scribes the decision-making process with which the data is generated, individual deviations from this process can be incorporated in a relatively straightforward fashion. The three extensions we introduced improve this process by taking into account individual differences in the decision-making process that the Thurstonian base model ignores. By doing so, the model is more resilient to deviations in the process and will accommodate a wider variety of data.

(22)

References

Batchelder, W. H., & Romney, A. K. (1988). Test theory without an answer key. Psychometrika, 53 , 71–92.

Bishop, C. M., & Lasserre, J. (2007). Generative or discriminative? Getting the best of both worlds. In J. M. Bernardo et al. (Eds.), Bayesian statistics (Vol. 8, pp. 3–24). Oxford, UK: Oxford University Press.

Bond, T. G., & Fox, C. M. (2001). Applying the Rasch model: Fundamental measurement in the human sciences. Hillsdale, NJ: Lawrence Erlbaum.

Buhrmester, M., Kwang, T., & Gosling, S. D. (2011). Amazon’s Mechanical Turk: A new source of inexpensive, yet high-quality, data? Perspectives on Psychological Science, 6 , 3–5.

Cheung, M. W.-L., & Chan, W. (2002). Reducing uniform response bias with ipsative measurement in multiple–group confirmatory factor analysis. Structural Equation Modeling, 9 , 55–77. Fagin, R., Kumar, R., Mahdian, M., Sivakumar, D., & Vee, E. (2006). Comparing partial rankings.

SIAM Journal on Discrete Mathematics, 20 , 628–648.

Fagin, R., Kumar, R., & Sivakumar, D. (2003). Comparing top k lists. SIAM Journal on Discrete Mathematics, 17 , 134–160.

Galton, F. (1907). Vox populi. Nature, 75 , 450–451.

Gelman, A., Meng, X.-L., & Stern, H. (1996). Posterior predictive assessment of model fitness via realized discrepancies. Statistica Sinica, 6 , 733–760.

Johnson, T. R., & Kuhn, K. M. (2013). Bayesian thurstonian models for ranking data using jags. Behavior Research Methods, 45 , 857–872.

Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 81–93.

Lee, M. D., Steyvers, M., De Young, M., & Miller, B. (2012). Inferring expertise in knowledge and prediction ranking tasks. Topics in Cognitive Science, 4 , 151–163.

Lee, M. D., Steyvers, M., & Miller, B. (2014). A cognitive model for aggregating people’s rankings. PloS One, 9 , e96431. doi: 10.1371/journal.pone.0096431

Lee, M. D., & Wagenmakers, E.-J. (2013). Bayesian modeling for cognitive science: A practical course. Cambridge, MA: Cambridge University Press.

Marden, J. I. (1996). Analyzing and modeling rank data. Boca Raton, FL: CRC Press.

Murphy, K. R., Jako, R. A., & Anhalt, R. L. (1993). Nature and consequences of halo error: A critical analysis. Journal of Applied Psychology, 78 , 218–225.

Myung, I. J. (2000). The importance of complexity in model selection. Journal of Mathematical Psychology, 44 , 190–204.

Plummer, M. (2003). Jags: A program for analysis of bayesian graphical models using Gibbs sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd international workshop on distributed statistical computing (DSC 2003) (pp. 20–22).

Romney, A. K., Weller, S. C., & Batchelder, W. H. (1986). Culture as consensus: A theory of culture and informant accuracy. American Anthropologist , 88 , 313–338.

Steyvers, M., Miller, B., Hemmer, P., & Lee, M. D. (2009). The wisdom of crowds in the recollection of order information. In J. Lafferty & C. Williams (Eds.), Advances in neural information processing systems (Vol. 23, pp. 1785–1793). Cambridge, MA: MIT Press.

Surowiecki, J. (2005). The wisdom of crowds. New York, NY: Random House LLC.

Thurstone, L. L. (1927). A law of comparative judgment. Psychological review , 34 , 273–286. Van Herk, H., Poortinga, Y. H., & Verhallen, T. M. M. (2004). Response styles in rating scales

evidence of method bias in data from six EU countries. Journal of Cross-Cultural Psychology, 35 , 346–360.