• No results found

Visualizations to help people interpret the Bayes factor : the effect of visualizations on estimation of posterior probabilities

N/A
N/A
Protected

Academic year: 2021

Share "Visualizations to help people interpret the Bayes factor : the effect of visualizations on estimation of posterior probabilities"

Copied!
20
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Visualizations to help

people interpret the

Bayes factor:

The effect of visualizations on

estimation of posterior probabilities

Bachelor thesis 2016

Name:

Koen Derks

Student number: 10518215

Supervisor:

Ravi Selker

(2)

Abstract

In this study the effectiveness of different visualizations on interpretation of the Bayes factor was investigated. The right interpretation was defined as the posterior probability of one of the two hypothesis. Participants estimated this posterior probability for Bayes factors ranging from 0 to 50. Through a questionnaire on Mechanical Turk participants were shown one of three visualizations; the pie chart, the stacked bar chart or the dot chart. In addition to this, there was also a condition in which participants only got to see the Bayes factor as a number. The results indicated that visualizations resulted in a better estimation of the posterior probability than the Bayes factor as a number and thus improved the interpretation of the Bayes factor. Visualizations differed in their effectiveness of improving the interpretation.

Introduction

Test statistics in today’s research are widely misinterpreted (Colquhoun, 2014; Hoekstra, Morey, Rouder & Wagenmakers, 2014). These misinterpretations form a big issue and lead to a couple of important problems in research. After all, researchers base their overall conclusions on the

interpretation of these test statistics. They serve as the base for scientific claims. If this interpretation is flawed, the conclusions based on it are not likely to be representative of the research. The

consequence of this is that it influences the whole research field negatively. That is why it is important to make sure misinterpretation does not happens. Misinterpretation often occurs in frequentist statistics, the most used type of statistics today (Sterne & Smith, 2001). The practice of using the p-value test statistic in scientific research is widespread (Hoekstra, Morey, Rouder, & Wagenmakers, 2014), but most researchers trust blindly on the arbitrary cut-off score of the alpha level without seeing the problems associated with it. One problem is that the p-value is based on data that are not observed, but instead on a hypothetical infinite repetition of the experiment (Wagenmakers, 2007). The implication of this is that it is widely misinterpreted, because researchers often believe that the p-value is based on their observed data (Hoekstra, Morey, Rouder, & Wagenmakers, 2014). Another problem with the p-value is that it has a tendency to overestimate the evidence against the null hypothesis. A p-value can only tell us the probability of the null hypothesis under the data, but says nothing about the alternative hypothesis (Colquhoun, 2014; Hoekstra, Morey, Rouder &

Wagenmakers, 2014; Masson, 2011). There is a dichotomation of evidence, meaning that the

hypothesis at hand is true or false. There is no option to bring nuance to this claim by involving chance in this. Confidence intervals are widely misinterpreted as well. It has been shown that researchers and students have no reliable knowledge of about the correct interpretation of confidence intervals. Hoekstra, Morey, Rouder & Wagenmakers (2014) gave researchers and students 6 statements about confidence intervals which were all false. The participants endorsed a high proportion of the items and it was concluded that they had no knowledge of the real interpretation of a confidence interval. This proposes the question of what we can infer from frequentist statistics when it is so flawed and hard to interpret.

A solution to the problems of the p-value and the misinterpretation of frequentist test statistics is the use of Bayesian statistics, more specifically the Bayes factor. The Bayes factor is a ratio of the probability that the data is observed given that the null hypothesis is true, divided by the probability that the data is observed given that the alternative hypothesis is true (Wagenmakers, 2007). Since the Bayes factor only uses observed data, the problem with unobserved data that we saw in the p-value is no longer an issue. However, Bayes factors are based on odds (Jarosz & Wiley, 2014) and people often misinterpret odds as well. This is because there is a subjective element in it

concerning the strength of the evidence (Davies, Combie & Tavakoli, 1998; Grimes & Schulz, 2008). There are different views on what is a big odds ratio and what is a small odds ratio. As a result of this,

(3)

the Bayes factor is often misinterpreted.

A solution for preventing the misinterpretations of test statistics is integrating statistical information in a visualization. Visualization of data increases the knowledge people can derive from these data (Van Wijk, 2005). They have been evident to be effective and efficient (Van Wijk, 2005). For example, in weather forecasting people make more accurate forecasts of weather when visualizations are used (Nadav-Greenberg et al., 2008). In chemistry, students understandings of chemical

representations improved substantially when a visualizations is displayed (Wu et al., 2001). Because particularly in Bayesian reasoning people have difficulty interpreting statistical information (Micallef, Dragicevic & Feteke, 2012), there is a need for a visualization of the Bayes factor. This might help improve the interpretation of the Bayes factor. However, evidence for this claim is scarce. Micallef, Dragicevic & Fekete (2012) carried out two experiments where participants read a story based on chances and were then showed a visualization of the corresponding odds. They found in both experiments that the participants comprehension of the visualisation was remarkably lower than in previous research (Cole, 1989; Cole et al., 1989) and the visualization had no effect on the

participants’ reasoning. However, they found that removing the numbers from the stories clearly improved the participants interpretation of the visualization (Micallef, Dragicevic & Fekete, 2012). It must be said that they focused on different information than the Bayes factor. This is evidence for the fact that there is no clear conclusion about the effect of visualization on Bayesian interpretation. That is why it is important to continue doing research about the effectivity of visualization on Bayesian interpretation and get a clear conclusion about what the most effective way to help people interpret Bayesian statistics is. A visualisation of odds might be closely related to a visualisation of the Bayes factor, but there is a fundamental difference between the two. A Bayes factor is a ratio of odds of two hypothesis, in contrast to only one particular odds value. Thus, a visualization might help in the case of the Bayes factor. People will be able to see the contrast with the alternative hypothesis when looking at a visualisation. When giving the Bayes factor as a number, people will only see evidence applied to one hypothesis. Visualising both hypothesis can result in a better understanding of the ratio (van Wijk, 2005).

In this study the effect of three visualisations of the Bayes factor was investigated to try to determine if people interpret the Bayes factor more efficiently using a visualisation than when shown only a number. The first hypothesis is that a visualization will result in less interpretational error than the Bayes factor as a number. The second hypothesis is that the visualisations will differ in the

measured interpretational error. A pilot study was conducted to choose the best 3 visualization out of 4 tested visualizations. Subsequently an experiment was conducted based on the pilot study.

Method Participants

For the experiment 453 participants were assembled in Mechanical Turk, a questionnaire site by Amazon.com (Berinsky, Huber & Lenz, 2011). Participants voluntarily applied to the research and were compensated with 1.50 dollar. The minimum age of the participants was 18 years. The

participants had to live in the United States of America and English had to be their native language. Materials

In the experiment 3 visualizations were tested. The visualizations were chosen on the criterion that they all represent relative information, like the Bayes factor. Because it appeared from the pilot that the pie chart, stacked bar chart and the dot chart resulted in the best estimation of correct chances, they were included in the experiment, see Figure 1. Four conditions were made. In the first condition participants got to see the Bayes factor as a number. In the second condition, participants

(4)

got to see the pie chart. In the third condition, participants got to see the stacked bar chart. In the final condition, participants got to see the dot chart. Participants were shown the scenario in the Appendix before the questions began. The first question used in the experiment was a question about the correct interpretation, measured by the estimation of the posterior probability (question 1a, see Appendix). Interpretation was defined as correctly estimating the chance that one of two hypothesis was correct and was measured by asking participants to rate the probability that one of two

hypothesis was correct on a scale from 1 to 100. This was done for different Bayes factors. Bayes factors ranged from 0 to 50, see Appendix. For the Bayes factors below one, the hypothesis were switched. The consequence of this was that the Bayes factor would be above one in favour of the competing hypothesis. In the Bayes factor condition, the Bayes factor was operationalized as: “The data show that Nedril is (BF) more likely to be a better drug than Mella”, see Appendix. The second question used in the experiment was a question about estimating the strength of evidence indicated by the Bayes factor (question 1b, see Appendix), according to Jeffreys scale (Jeffreys, 1961). The answer options for this question consisted of “Weak”, “Moderate”, “Strong” and “Very strong”. For Bayes factors below one, the names of the drugs in the question were switched to measure strength of evidence effectively for that hypothesis.

Figure 1. Visualisations used in the experiment for BF=4. From left to right: Pie chart, Stacked bar chart and Dot

chart.

Procedure

In the experiment, participants were randomly assigned to one of the four conditions in Qualtrics. After filling in some demographic questions about their language, colour deficiencies and education, they were shown the scenario in the Appendix. This scenario stated a hypothetical situation in which a disease on a planet had killed 10% of the population. Two drugs were developed and researched. It was up to the participants to help the researchers interpret the data they found. When they finished reading the story, participants filled in one example question to make sure they

(5)

had a good understanding of the scenario. In this example question there was a Bayes factor of 1, thus the corresponding probability to be estimated was 50%. The exact question proposed to the

participants was the same through the whole experiment (question 1a). After the example question participants continued to the experiment. This consisted of 35 counterbalanced questions regarding the estimations of posterior probabilities. In each item a visualisation and question 1a were offered to the participant. In the Bayes factor condition, the BF statement from the Appendix was shown instead of a visualization. The participants filled in a number with a maximum of one decimal. They then answered the second question about estimating the strength of evidence (question 1b). After

completing the 35 questions, the participants were reminded that if they had any questions they could send an email to the researchers or the ethical committee and were then thanked for their

participation. Results

Of the 453 participants that completed the experiment data from 10 participants were excluded because English was not their native language. An additional 10 observations were excluded because the participants participated twice in the survey. Data from 16 participants were excluded because they did not live in the United states of America. 19 participants spend more than 2 standard deviations from the average time (mean = 602, SD = 238) filling in the survey. Because they might have used other sources than their own thinking, they were not included in the analysis. 31 participants were not included in the analysis because they answered the question in the experiment where BF = 1 with a percentage lower than 25 or higher than 75. They did not understand the concept or did not seriously participate in the experiment. The participants should have gotten this question right, because the answer was already given to them in the example question. Ultimately, data from 367 participants was used for the final analysis. The Bayes factor condition consisted of 74 participants. The pie condition consisted of 103 participants. The Stacked bar condition consisted of 92 participants. Finally, the dot condition consisted of 98 participants.

The items all had a true posterior probability which was calculated by the formula

𝑃(𝐷|𝐻) = 𝐵𝐹/(𝐵𝐹 + 1), in which P(H) is the true posterior probability and BF is the Bayes factor used in the item. This formula is derived from the knowledge that the sum of the ratio is 𝐵𝐹 + 1, since the odds for the competing hypothesis are set to one. For example, a Bayes factor of 4 would give a posterior probability of 4

5= 0.8 that the hypothesis we are interested in is correct.

For each item within conditions, the mean of the estimated posterior probabilities for every true posterior probability was calculated as an indication of the estimation of posterior probability. When the means of the estimated probabilities were plotted against the correct probabilities for each visualisation, the result was a visual curve to see how each visualisation affects the estimation of probabilities, see Figure 2. The line for the correct estimation of probabilities is a linear line with a slope of one and an intercept of zero. This way the true posterior probability equals the estimated posterior probability. When the visualisation estimation curves lie closer to the correct estimation probability line than the Bayes factor line, it indicates that the visualisation had a positive impact on the estimation of posterior probabilities and thus that the visualisation has helped people interpret the Bayes factor better.

(6)

Figure 2. Estimated posterior probabilities plotted against true posterior probabilities for every condition. The

Bayes factor condition (grey) seems to lie further away from the correct estimation line than the pie chart (red), bar chart (blue) and dot chart (green).

To compare and analyse the visualizations with each other and with the Bayes factor alone Brier scores (Ferro, 2007) were computed for every Bayes factor in every condition. The Brier score is a proper score function that measures the accuracy of probabilistic predictions in percentages. These Brier scores were calculated according to the formula 𝐵𝑆 = 𝑁1∑𝑁 (𝑓𝑡 − 𝑜𝑡)2

𝑡=1 . In this formula, N is

the number of participants, t is the number of the participant, ft is the estimated posterior probability by the participant and ot is the correct posterior probability corresponding to the Bayes factor. This simply results in the mean squared difference from the correct probabilities. A low Brier score is better than a high Brier score, since it indicates a lower difference from the correct probabilities. 0 is the best possible Brier score, while 1 is the worst possible Brier score. The mean of the squared difference over all participants in a condition is the Brier score, which gives an indication of the effectiveness of the Bayes factor representation.

Test results

For every condition the mean of the squared difference and their standard deviation was calculated, see Table 1. Figure 4 shows the Brier scores with confidence intervals for every Bayes factor representation.

Table 1

Number of participants (N), mean and standard deviation of the squared difference for every condition.

Condition N Mean (Brier score) SD Bayes factor 74 0.111 0.068

Pie 103 0.015 0.027

Stacked Bar 92 0.031 0.038

(7)

An independent Bayesian t-test with a noninformative Jeffreys prior was conducted on the Brier scores in the Bayes factor condition and the Pie chart condition. A noninformative Jeffreys prior gives high probabilities to the tail of this distribution, see Figure 3. The Bayesian t-test tests the null hypothesis that the mean difference of a normal population is 0. Specifically the Bayes factor

compares two hypotheses: that the standardized effect size is 0, or that the standardized effect size is not 0 (Morey & Rouder, 2015). A Cauchy prior was placed on the standard effect size. The Cauchy prior is a similar chance distribution to the normal distribution, but with heavier tails on both sides, see Figure 3.

Figure 3. Density plots of the Jeffreys prior on the mean difference (left) and the Cauchy prior compared to the

normal distribution on the effect size (right).

Figure 4. Brier scores and confidence intervals for BF condition, pie condition, bar condition and dot condition.

The visualizations all differ from the BF condition.

There was a difference in the Brier scores for the Bayes factor condition and the Pie chart condition. The Bayes factor indicated that the data were 1.6e24 times more likely under the

alternative hypothesis than under the null hypothesis of no difference. The same analysis was used to compare the Brier scores in the Bayes factor condition and the Stacked bar chart condition. There was a difference in the Brier scores for the Bayes factor condition and the Stacked bar chart condition. The Bayes factor indicated that the data were 5.3e14 times more likely under the alternative hypothesis than under the null hypothesis of no difference. The Bayes factor condition was also compared to the

(8)

Dot chart condition. There was a difference in the Brier scores for the Bayes factor condition and the Dot chart condition. The Bayes factor indicated that the data were 3.1e8 times more likely to under the alternative hypothesis than under the null hypothesis of no difference. Results are displayed in Table 2. These results are interpreted as that the pie chart, stacked bar chart and dot chart conditions were more effective in estimating the posterior probabilities than the Bayes factor condition.

Table 2

Means, standard deviations of the squared differencein all conditions and Bayes Factors for the conducted t-test comparisons of the Brier scores.

Comparison Mean (SD) Bayes Factor BF condition 0.111 (0.068)

Pie condition 0.015 (0.027) 1.6e24 BF condition 0.111 (0.068)

Stacked bar condition 0.031 (0.038) 5.3e14 BF condition 0.111 (0.068)

Dot condition 0.041 (0.060) 3.1e8

Note. Bayes factor column are Bayes factors in favour of alternative hypothesis.

Alternative hypothesis = mean difference is not 0. Decimals are indicated by dots.

The visualisations were compared with each other to see if there was a difference in the Brier scores between them, the results are displayed in Table 3. The Bayes factor for the comparison between the pie chart and the stacked bar chart indicated that the data were 22 times more likely under the alternative hypothesis than under the null hypothesis of no difference. The Bayes factor for the comparison between the pie chart and the dot chart indicated that the data were 237 times more likely under the alternative hypothesis than under the null hypothesis of no difference. However, the Bayes factor for the comparison between the stacked bar chart and the dot chart indicated that the data were 0.429 times more likely under the alternative hypothesis than under the null hypothesis of no difference. This is interpreted as that the pie chart was more effective in estimating the posterior probabilities than both the dot and the bar charts. The bar chart and dot chart did not differ in effectiveness.

Table 3

Means, standard deviations of the squared differencein all conditions and Bayes factors for the conducted t-test comparisons of the Brier scores.

Comparison Mean (SD) Bayes factor Pie condition 0.015 (0.027)

Stacked bar condition 0.031 (0.038) 22 Pie condition 0.015 (0.027)

Dot condition 0.041 (0.060) 237 Stacked bar condition 0.031 (0.038)

Dot condition 0,041 (0,020) 0.429

Note. Bayes factor column are Bayes factors in favour of alternative hypothesis.

Alternative hypothesis = mean difference is not 0. Decimals are indicated by dots.

(9)

Explorative results

When the visual curve of the data is examined it seems that in the Bayes factor condition the estimations of the Bayes factors below 1 (posterior probabilities lower than 50%) were worse than the estimations of the Bayes factors equal to and above 1. This suggested a misunderstanding of the procedure in the Bayes factor condition. For this reason the data for the Bayes factors equal to and above 1 were analysed again for all conditions to see if there was still a difference between the conditions. Visual curves of the estimation are displayed in Figure 5. A Bayesian t-test was conducted to compare the Brier scores of all conditions from Bayes factors ranging from 1 to 50. The results are displayed in Table 4. The earlier findings proved somewhat robust. The pie chart (𝐵𝐹10= 702) and the

stacked bar chart (𝐵𝐹10= 16) were still more likely to result in less interpretational error than the

Bayes factor alone. The dot chart was not likely to result in less interpretational error than the Bayes factor alone (𝐵𝐹10= 0.38). The pie chart and the stacked bar chart seemed both more efficient in

estimating the posterior probability than the dot chart.

Figure 5. Estimation of posterior probabilities for the Bayes factor condition (grey), pie chart condition (red), bar

chart condition (blue) and dot chart condition (green) for Bayes Factors above 1. The BF condition does a better job in the BF>1 spectrum than in the BF<1 spectrum. Still, there is a notable difference between the pie- and bar chart versus the BF condition.

(10)

Table 4

Means, standard deviations of the squared difference in all conditions and Bayes factors for the conducted t-test comparisons on the Brier scores for Bayes factors equal to and above 1.

Comparison Mean (SD) Bayes factor BF condition 0.047 (0.074)

Pie condition 0.013 (0.024) 702 BF condition 0.047 (0.074)

Stacked bar condition 0.020 (0.028) 16 BF condition 0.047 (0.074)

Dot condition 0.033 (0.064) 0.38 Pie condition 0.013 (0.024)

Stacked bar condition 0.020 (0.028) 0.89 Pie condition 0.013 (0.024)

Dot condition 0.033 (0.064) 7 Stacked bar condition 0.020 (0.028)

Dot condition 0.033 (0.064) 0.061

Note. Bayes factor column are Bayes factors in favour of alternative hypothesis.

Alternative hypothesis = mean difference is not 0. Decimals are indicated by dots.

Brier scores could not be calculated for the Jeffreys questions, because the Brier score uses the fact that the data are based on chances. Therefore the questions about the strength of evidence were analysed using the mean root squared difference scores (MRSD), according to the formula

1

𝑁∑ √(𝐸𝑣 − 𝐴𝑣) 2 𝑁

𝑡=1 . In this formula, N is the number of participants, t is the number of the

participant, Ev is the coded estimated value and Av is the coded actual value. Weak was coded as a 1, moderate as a 2, strong as a 3 and very strong as a 4. The MRSD scores for each of the Bayes factors was calculated using this decoding. The MRSD’s per Bayes factor were plotted in comparison with the true posterior probabilities, see Figure 6. They were then compared with each other using a Bayesian t-test. The analysis showed that there was a difference between the Bayes factor condition and the pie condition. The Bayes factor indicated that the data were 827 times more likely under the alternative hypothesis than under the null hypothesis of no difference. This means that the Bayes factor condition was better in estimating the strength of evidence than the pie condition. There was also a difference between the Bayes factor condition and the stacked bar condition. The Bayes factor indicated that the data were 1.581.923times more likely under the alternative hypothesis than under the null hypothesis of no difference. This is evidence for the Bayes factor condition being better in estimating the strength of evidence than the stacked bar condition. Finally, there was a difference between the Bayes factor condition and the dot condition. The Bayes factor indicated that the data were 375 times more likely under the alternative hypothesis than under the null hypothesis of no difference. The Bayes factor condition was also better than the dot chart condition in estimating the strength of evidence. There was no difference between the visualizations themselves, see Table 5. These results are interpreted as that the strength of evidence is better interpreted when showing a number than when showing a visualization. An interesting observation derived from figure 6 is the dot condition has peaks in the curve around posterior probabilities of 0.2 (20%) and 0.8 (80%).

(11)

Figure 6. Means of estimated strength of evidence in Bayes factor condition (grey), pie chart condition (red),

stacked bar chart condition (blue) and dot condition (green) plotted against posterior probabilities (in %). Table 5.

Means, standard deviations and Bayes factors of the root squared difference of the strength of evidence in all comparisons.

Comparison Mean (SD) Bayes factor BF condition 0.353 (0.250)

Pie condition 0.779 (0.490) 827 BF condition 0.353 (0.250)

Stacked bar condition 0.820 (0.330) 1,581,923 BF condition 0.353 (0.250)

Dot condition 0.711 (0.432) 375 Pie condition 0.779 (0.490)

Stacked bar condition 0.820 (0.330) 0.265 Pie condition 0.779 (0.490)

Dot condition 0.711 (0.432) 0.282 Stacked bar condition 0.820 (0.330)

Dot condition 0.711 (0.432) 0.423

Note. Bayes factor column are Bayes factors in favour of alternative hypothesis.

Alternative hypothesis = mean difference is not 0. Decimals are indicated by dots.

Finally, the data were split on education to see if there was a difference in if people with lower or higher education would understand the Bayes factor and the visualisation in a different way. Brier scores with confidence intervals were computed for every level of education, see Figure 7. The doctoral condition has no confidence interval because there was only 1 participant in this condition.

(12)

No analysis was performed on these Brier scores. It seems that interpretation of the Bayes factor is independent of education, but these results are further interpreted in the discussion.

Figure 7. Brier scores with confidence intervals for each level of education. It seems that all the Brier scores are

the same, which would mean effectivity of the visualizations is independent of education.

Discussion

It appears evident that showing any visualization results in less interpretational error than showing a Bayes factor alone. The Bayes factor condition differs from all the conditions in the amount of interpretational error. It can therefore be concluded that the visualizations have a better estimation of the posterior probabilities than the Bayes factor alone. This is in line with the first hypothesis and follows from the phenomena that describe the effectiveness of visualization discussed in the introduction. It is also evident that visualizations differ in their effectivity for this interpretation. The pie chart differs from the stacked bar chart condition and the dot condition, while the stacked bar chart condition and the dot condition do not differ from each other. The stacked bar chart and the dot chart had more interpretational error than the pie chart on the whole spectrum of Bayes factors. This observation is in line with the second hypothesis that visualizations differ in their effectivity. Both results are robust for the whole spectrum of Bayes factors, but not for the specific Bayes factors above 1. The pie chart does not differ from the stacked bar chart in this spectrum, while it does differ from the Bayes factor condition and the dot chart condition. Also, the dot chart does not differ from the Bayes factor condition on this spectrum.

An interesting observation are the results of the explorative Jeffreys questions. According to these results, it is evident that the strength of evidence as quantified by Jeffreys (1961) is better interpreted when showing a Bayes factor alone than when showing a visualization. This is contradictory with the first two findings of this research. A possible explanation for this is that a number is a better indicator of strength of evidence because it provides the observer with a ratio and a reference point for a strength of evidence equal to weak, namely 1. A visualization does not have

(13)

these properties as there are no numbers in them. A visualization on the other hand is a better rendering of a ratio, as it displays evidence of two hypotheses. Another interesting observation is the peaks in the dot condition at posterior probabilities of 20% and 80%. These can possibly be explained by the dot visualization itself. Simple fractions are visualized with fewer dots than complex fractions. For example a Bayes factor of 4, which gives the fraction 45 , is visualized with four dots on the left side of the line and one on the right side of the line. A Bayes factor of 5.25, which gives the fraction 214, is visualized with 21 dots on the left side of the line and 4 dots on the right side of the line. Fewer dots might raise the impression that the evidence is stronger than when more dots are displayed. This can be the cause of the observed peaks at those specific posterior probabilities, because these Bayes factors have been visualized with fewer dots than the other ones. Furthermore, the exploratory analysis on education is flawed in some aspects. The Brier score for a doctoral level of education is based on only 1 participant and the Brier score for professional level of education is based on only 8 participants. These small samples make the lines not representative for those levels of education. Participants with some college as a level of education (N = 80) almost consistently had the lowest brier scores over the whole spectrum of Bayes factors. This is not a logical observation and thus an

explanation is lacking.

One limitation of this research is that it appears that the understanding of the questions in the Bayes factor condition was not optimal. This is derived from the visual curves of the estimations in Figure 2, where it seems that some of the participants in the Bayes factor condition estimated the chance that Mella was a better drug than Nedril. Because the estimation curves are based on mean estimations, it might be that high and low estimations influenced each other and resulted in relatively high estimations for the Bayes factors below 1. The Bayes factors above 1 did have a right

understanding in the Bayes factor condition, hence the exploratory analysis for those Bayes factors. It appeared that leaving out the Bayes factors below 1 made a difference for the results. Showing a Bayes factor alone still resulted in more interpretational error than showing a pie chart and a stacked bar chart, but the dot chart did a worse. Still, understanding the significance of the whole spectrum is an important point to stress. The whole range of Bayes factors needs to be properly interpreted if the ultimate goal is to improve interpretation of the Bayes factor.

It is important to further investigate the effect of visualizations on interpreting the Bayes factor. Some evidence for the hypothesis that a visualization does improve this interpretation is present after this research. However, the contradictory suggestion that a Bayes factor alone results in a better interpretation of the strength of the evidence is important to notice. It seems that both of the elements, a number and a visualization, have distinct qualities that improve the overall interpretation. A visualization might improve the estimation of posterior probabilities, while a number might improve the estimation of the strength of the evidence. Since the visual curves of estimation still show slight deviations from the correct probability line, it seems that there is still room for improvement. Further research might specialise in the effect of showing a visualization and a Bayes factor combined. The distinct qualities of both elements combined might strengthen the interpretation. However, it is also possible that this results in a misinterpretation of posterior probabilities and strength of evidence. The results of the research are not generalizable to a broader audience than Mechanical Turk users only. For this research only Mechanical Turk users were used it was the most easy way to assemble participants online. Mechanical Turk users might be used to making these kind of questionnaires and go through them quicker than the rest of the population. However, it is not hard to imagine the results replicating in other parts of the global population. This generalization is a goal for further research to assess the effect of visualization on interpretation of the Bayes Factor.

For the research question at hand, namely if visualizations help improve the interpretation of the Bayes factor, it can be concluded that a pie chart, a stacked bar chart and a dot chart all improve

(14)

the interpretation of posterior probabilities over the Bayes factors as a number. This means that showing a visualization is more effective than showing a Bayes factor as a number when the posterior probabilities are of great importance. Between the visualizations there are differences in the degree in which they improve the interpretation. The pie chart showed the most promise, since it was the best at improving the interpretation.

Research proposal Introduction

The results of the previous experiment provide evidence that visualizations improve the interpretation of posterior probabilities over showing a Bayes factor alone. Contradictory to this, there also seems to be evidence that showing a Bayes factor alone improves the interpretation of the strength of evidence residing in the Bayes factor over showing a visualization. The goal of the previous study was to improve the interpretation of the Bayes factor, but there is evidence that there are still slight deviations between the estimated posterior probabilities and the true posterior probabilities. This means the visualizations did not succeed in perfecting the estimation of these posterior probabilities. However, the visualization can be improved by combining the Bayes factor and a visualization.

This study investigates the effect of showing a Bayes factor combined with a visualization on interpretation of the Bayes factor in two area’s; the estimation of posterior probabilities and the estimation of the strength of evidence. Herman, Melançon & Marshall (2000) argue that it is often advantageous to reduce the number of visible elements in a graph to improve the clarity of the visualization. According to this phenomenon, showing both the Bayes factor and the visualization should not improve the interpretation of the Bayes factor. However, this is a phenomenon they based on the interpretation of graphs without a statistical meaning. The graph used in the experiment has a statistical meaning in it, namely a posterior probability that can be derived from the graph itself. Tory & Möller (2004) argue that it is better to group related information for easy access, so it might be better to include the actual Bayes factor when showing the visualization. Then the observer cansee the statistical information at once and use the visualization to adjust their judgement based on the information. With this reasoning it is expected that showing a Bayes factor integrated in a visualization results in less interpretational error than a Bayes factor and a visualization alone. It is also expected from previous research that interpretation is independent of education. It is important to make sure that interpretation is independent of education, because people from all educational levels have to understand the visualization and are able to interpret it in the same way.

Method Participants

Because this research is based on the sample of the previous research, it is important that the novel participants are representative of this sample. This means they have to be American Mechanical Turk users, with English as their native language. The participants have to be older than 18 years to participate. Mechanical Turk will be a consistent way of collecting participants while rewarding them fairly for their efforts (Berinsky, Huber & Lenz, 2011).

Materials

For this research a Qualtrics survey is needed that can be distributed through Mechanical Turk. Two visualizations are necessary to test the hypothesis. One will be the pie chart (A) and the

(15)

other one will be the pie chart combined with the Bayes factor (B), see figure 8. The scenario used at the beginning of the experiment is equal to the scenario in the Appendix. Question 1a was shown to measure estimation of posterior probability, see Appendix. Question 1b was shown to measure the estimation of the strength of evidence. Interpretation of strength of evidence will be measured by rating the strength of evidence on Jeffreys scale (Jeffreys, 1961).

Figure 8. Visualizations that will be tested in the questionnaire; Pie chart (A) & combined pie chart (B).

Procedure

The experiment consist of three conditions; the Bayes factor condition, the pie chart condition and the combined condition. In the pie chart condition the pie chart (A) will be shown. In the

combined condition the combined pie chart (B) will be shown. In the Bayes factor condition the BF statement from the Appendix will be shown. Participants will first answer a question about their level of education, so that a more robust analysis than the one in the previous experiment can be

performed on these data. After this, participants will be randomly assigned to one of the three conditions. They will be shown the scenario in the Appendix. This scenario stated a hypothetical situation were a disease on a planet had killed 10% of the population. Two drugs were developed and researched. It was up to the participants to help the researchers interpret the data they found. To make sure they have read the scenario, the continue button will appear after 30 seconds. Participants then proceed to answer 35 counterbalanced questions (1a) on different Bayes factors. A second question (1b) is displayed on the screen together with the first one. The participants have to select one of the answer options for this question before continuing to the next. After filling in the whole survey, participants will be thanked for their participation and reminded that for any questions or ethical objections they could send an email to the researcher or the ethical committee. Interpretation of the Bayes factor will be measured as the posterior probability. The operationalization of this is the probability that the participants thought the hypothesis was true (in percentages). Interpretation of strength of evidence will be measured by rating the strength of evidence on Jeffreys scale (Jeffreys, 1961).

(16)

Analysis plan

To start analysing the data, it is useful to construct visual curves for these data. In these curves, the mean estimations for every Bayes factor are plotted against the actual chances, see Figure 2. Posterior probabilities are calculated using the formula 𝑃 (𝐷|𝐻) = 𝐵𝐹/(𝐵𝐹 + 1). This gives a brief overview of what the data might look like. Bigger deviations from the correct estimation line indicate worst estimation. To analyse this data, one can calculate the Brier scores for every Bayes factor according to the formula 𝐵𝑆 =𝑁1 ∑𝑁 (𝑓𝑡 − 𝑜𝑡)2

𝑡=1 . In this formula N is the number of participants, t is

the number of the participant, ft is the estimated posterior probability and ot is the correct posterior probability corresponding to the Bayes factor in the question . A Bayesian t-test can then be

performed on the Brier scores to determine if there is a difference in the Brier scores, since these Brier scores are indicative for the effectiveness of the Bayes factor representation in improving the interpretation of the Bayes factor. The Jeffreys questions will be coded for analysis. Weak will be coded as a 1, moderate as a 2, strong as a 3 and very strong as a 4. Mean root square difference (MRSD) scores will be calculated according to the formula 𝑁1∑𝑁 √(𝐸𝑣 − 𝐴𝑣)2

𝑡=1 . These MRSD scores

will then be compared using a Bayesian t-test to determine if there is a difference in these scores between conditions. Finally, the data will be split on education to perform a robust analysis on this. It is important that every level of education has at least 30 participants, for representativeness of this level. The experiment will remain online until every educational level has at least 30 participants in it. There will be a limit of 100 participants per educational level to make sure the experiment will not be too expensive. When this limit for an educational level is reached, participants from that educational level will not be able to participate anymore. Brier scores for every level of education are then calculated and a Bayesian ANOVA will be conducted on the data to investigate a main effect of education (Wetzels, Grasman & Wagenmakers, 2012).

Interpretation of results

Expectations are that the Bayes factor condition will have a higher Brier score in the estimation of probabilities than both the combined condition and the pie chart condition. Also, it is expected that the combined condition will have a lower Brier score in the estimation of posterior probabilities than the pie chart condition. If the expectation that the Bayes factor condition has a higher Brier score than the pie condition does not turn out to be the outcome, it can be concluded that the findings found in the previous experiment are not robust. This would mean the results for the pie chart visualization of the previous experiment cannot be generalized to other experiments and thus that they do not represent a phenomenon. If this expectation turns out to be true, it can be concluded that the results for the pie chart visualization of the previous experiment are robust and represent a phenomenon. On the other hand, if the expectation that the Bayes factor condition has a higher Brier score than the combined condition does not turn out to be the outcome it can be concluded that integrating a Bayes factor in a visualization decreases the interpretation of the Bayes factor over a Bayes factor as a number. This would mean that the qualities of a visualization and a number alone do not result in an optimization of the estimation of posterior probabilities. However, if this expectation turns out to be true it can only be concluded that integrating a Bayes factor in a visualization improves the interpretation of the Bayes factor over the Bayes factor as a number. To conclude that integrating a Bayes factor in a visualization improves the interpretation of the Bayes factor, it is needed that the combined condition has a lower Brier score than both the Bayes factor condition and the pie chart condition.

For the Jeffreys questions, it is expected that the pie condition will have a higher MRSD score than both the combined and the Bayes factor condition. It is also expected that the combined

(17)

condition will have a lower MRSD score than the Bayes factor condition. Only if these exact

expectations turn out to be the outcomes, then it can be concluded that integrating the Bayes factor in a visualization results in a better estimation of the strength of evidence. If the pie chart condition has a lower MRSD score than the Bayes factor condition, it can be concluded that the results of the previous experiment are not robust. When the pie chart condition has a higher MRSD score than the Bayes factor condition but lower than the combined condition, it can be concluded that integrating the Bayes factor in a visualization decreases the interpretation of the strength of evidence over the Bayes factor as a number and a visualization. If the combined condition has a higher MRSD score than the Bayes factor condition but lower than the pie chart condition, it can only be concluded that integrating a Bayes factor in a visualization improves the interpretation of the strength of evidence over a pie chart. To conclude that the combining a Bayes factor and a visualization results in a better estimation of posterior probabilities and a better estimation of strength of evidence, the exact expectations have to be the outcomes. If the expectations do not turn out to be true, the hypothesis cannot be confirmed.

(18)

Literature

Berinsky, A. J., Huber, G. A., & Lenz, G. S. (2011). Using Mechanical Turk as a subject recruitment tool f or experimental research. Submitted for review.

Brooks, S. P. (2003). Bayesian Computation: a Statistical Revolution. Philosophical Transactions of the R oyal Socie ty of London A: Mathematical, Physical and Engineering Sciences, 361, 1813, 2681 -2697.

Bland, J. M., & Altman, D. G. (2000). The Odds Ratio. Bmj, 320, 7247, 1468.

Chen, H., Cohen, P., & Chen, S. (2010). How Big is a Big Odds Ratio? Interpreting the Magnitudes of Od ds Ratios I n Epidemiological Studies. Communications in Statistics—Simulation and Comput ation®, 39, 4, 860-864.

Chen, C. H., Härdle, W. K., & Unwin, A. (Eds.). (2007). Handbook of Data Visualization. Springer Science & Busines s Media. (Chapter 16: Visualisation in Bayesian Data Analysis).

Cole, W. G. (1989, March). Understanding Bayesian reasoning via Graphical Displays. In ACM SIGCHI Bu

lletin, 20, 11, 381-386.

Cole, W. G., & Davidson, J. E. (1989, November). Graphic Representation Can Lead To Fast and Accurat e Bayesian Reasoning. In Proceedings/the... Annual Symposium on Computer Application [sic

] in Medical Care. Sy mposium on Computer Applications in Medical Care (pp. 227-231). Ame

rican Medical Informatics Association.

Colquhoun, D. (2014). An Investigation of the False Discovery Rate and the Misinterpretation of P-Valu es. Royal Society open science, 1, 3, 140-216.

Davies, H. T. O., Crombie, I. K., & Tavakoli, M. (1998). When can Odds Ratios Mislead?. Bmj, 316, 7136, 989-991.

Dienes, Z. (2011). Bayesian versus Orthodox Statistics: Which side are you on?. Perspectives on Psychol ogical Science, 6, 3, 274-290.

Ferro, C. A. (2007). Comparing probabilistic forecasting systems with the Brier score. Weather and For

ecasting, 2,2, 5, 1076-1088.

Grimes, D. A., & Schulz, K. F. (2008). Making Sense of Odds and Odds Ratios. Obstetrics & Gynecology , 111, 2, 423-426.

Herman, I., Melançon, G., & Marshall, M. S. (2000). Graph Visualization and Navigation in Information Visualization: A Survey. Visualization and Computer Graphics, IEEE Transactions on, 6, 1, 24-4

Hoekstra, R., Morey, R. D., Rouder, J. N., & Wagenmakers, E. J. (2014). Robust Misinterpretation of Co nfidence Intervals. Psychonomic Bulletin & Review, 21, 5, 1157-1164.

Jarosz, A. F., & Wiley, J. (2014). What are the Odds? A Practical Guide to Computing and Reporting Bay es Factors . The Journal of Problem Solving, 7, 1, 2.

Jeffreys, H. (1998). The theory of probability. OUP Oxford.

Lin, W. Y., & Lee, W. C. (2012). Presenting the Uncertainties of Odds Ratios Using Empirical-Bayes Predi ction Intervals. PloS one, 7, 2.

Masson, M. E. (2011). A Rutorial on a Practical Bayesian Alternative to Null-Hypothesis Significance Tes ting. Behavior research methods, 43, 3, 679-690.

Micallef, L., Dragicevic, P., & Fekete, J. D. (2012). Assessing the Effect of Visualizations on Bayesian Rea soning thr ough Crowdsourcing. Visualization and Computer Graphics, IEEE Transactions on , 18, 12, 2536-2545.

Morey, R.D. & N. Rouder, J.N. (2015). BayesFactor: Computation of Bayes Factors for Common Designs . R packag e version 0.9.12-2. https://CRAN.R-project.org/package=BayesFactor

Nadav-Greenberg, L., Joslyn, S. L., & Taing, M. U. (2008). The effect of uncertainty visualizations on dec ision making in weather forecasting. Journal of Cognitive Engineering and Decision Making , 2, 1, 24-47.

Shorten, A., & Shorten, B. (2015). What is an Odds Ratio? What does it mean?. Evidence Based Nursing , ebnurs-2 015.

(19)

Siegfried, T. (2010). Odds are, it's wrong: Science Fails to Face the Shortcomings of Statistics. Science n ews, 177, 7, 26-29.

Sterne, J. A., & Smith, G. D. (2001). Sifting the evidence—what's wrong with significance tests?. Physic

al Therapy, 81, 8, 1464-1469.

Tory, M., & Möller, T. (2004). Human Factors in Visualization Research. Visualization and Computer Gr

aphics, IEEE Transactions on, 10, 1, 72-84.

Van Wijk, J. J. (2005). The Value of Visualization. In Visualization, 2005. 5, 79-86. IEEE.

Wagenmakers, E. J. (2007). A Practical Solution to the Pervasive Problems of P-values. Psychonomic bul letin & rev iew, 14, 5, 779-804.

Wagenmakers, E. J., Lodewyckx, T., Kuriyal, H., & Grasman, R. (2010). Bayesian Hypothesis Testing for Psychologists: A Tutorial on the Savage–Dickey Method. Cognitive psychology, 60, 3, 158-18 9.

Wetzels, R., Grasman, R. P., & Wagenmakers, E. J. (2012). A Default Bayesian Hypothesis Test for ANO VA Designs. The American Statistician, 66, 2, 104-111.

Wu, H. K., Krajcik, J. S., & Soloway, E. (2001). Promoting understanding of chemical representations: St udents' use of a visualization tool in the classroom. Journal of research in science teaching, 3

8, 7, 821-842.

Zellner, A. (1978). Jeffreys-Bayes Posterior Odds Ratio and the Akaike Information Criterion for Discrim inating between Models. Economics Letters, 1, 4, 337-342.

(20)

Appendix Bayes Factors tested:

>=1 1 1.3 1.6 1.9 2.2 2.5 2.8 3.1 3.4 3.7 4 5 7 10 15 19 25 50 <1 .77 .63 .53 .46 .39 .36 .33 .29 .27 .25 .2 .14 .1 .06 .05 .04 .02

Scenario:

On the planet Boaconic, the disease Effrafax has killed 10% of the population. The planet's t op medical researchers are frantically working to develop drugs that can fight the disease. Within a co uple of months, the researchers produce two different drugs, Nedril and Mella. But which drug is bett er? Based on existing medical knowledge there is nothing to suggest that one drug is better than the o ther. To find the best drug, the researchers conduct an experiment where they test both drugs on a gr oup of infected Boaconiconians.

Bayes factor condition statement of BF:

The data show that Nedril is (Bayes factor) times more likely to be a better drug than Mella. Question 1a:

What is the probability that Nedril is a better drug than Mella? (as a percentage, on a scale f rom 0-100).

Question 1b:

How strong do you think the evidence is in favour of Nedril being a better drug than Mella? Answer options:

 Weak  Moderate  Strong  Very strong

Referenties

GERELATEERDE DOCUMENTEN

·genoemne rede ·word weerspreek deur di~ reaksie van die proefpersone op· hierdie vraag.. Die probleem

Although many species showed increasing numbers after the realisation of the compensation plan for the Deurganckdok, most species do not yet meet the conservation targets in

Het betreft 5 rijen van enkele mooi afgewerkte witte natuurstenen van ongelijke grootte samengehouden door gele zandmortel en onderaan door de bruingele compacte zandleem (zie

classes); very dark grayish brown 10YR3/2 (moist); many fine and medium, faint and distinct, clear strong brown 7.5YR4/6 (moist) mottles; uncoated sand grains; about 1% fine

Door deze pool te verbinden met kennis- en onderwijs- instellingen kan hieruit een Lerend netwerk groeien, een inspirerende leeromgeving voor ondernemers en andere belanghebbenden..

Stichting Heempark Heech verant­ Diverse kosten 52.450 Naast de aanleg van het heempark zal woordelijk zijn voor de coordinaue en TOTAAL 1 .305 .559 ook het bebeer - in de

In order to disclose which intermetallic phase in AlNi alloys could potentially be active in HER, we calculated the free energy of hydrogen adsorption for AlNi intermetallics (see

The results of our analysis are compared with extensive simulations and show excellent agreement for throughput, mean number of active sources and mean buffer occupancy at the