• No results found

Facilitating Bayes factor interpretation using visualizations

N/A
N/A
Protected

Academic year: 2021

Share "Facilitating Bayes factor interpretation using visualizations"

Copied!
9
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Facilitating Bayes Factor Interpretation

Using Visualizations

Daniël Bartolomé Rojas

University of Amsterdam d.bartolome.r@gmail.com

Abstract

The Bayes factor plays a pivotal role in Bayesian statistics and hypothesis testing. It has numerous benefits compared to other methodologies like null-hypothesis significance testing. However, a downside of Bayes factors is its interpretability. Since visualizations have proven to be excellent tools for the representation of numerical information, this study focusses on investigating whether visualizations could facilitate Bayes factor interpretation. 453 participants estimated posterior probabilities of certain Bayes factors. These were either visualized or displayed numerically. Results show that Bayes factor interpretability massively increases through the use of visualizations. Despite the fact that all tested visualizations improve interpretation more than numerical Bayes factors do, the pie-chart outperforms every other visualization.

Data analysis in social and behavioural re-search is largely driven by two types of sta-tistical frameworks: frequentist statistics and Bayesian statistics. In psychological research, the frequentist framework is arguably the most commonly used of the two. P-value null-hypothesis significance testing (NHST) – the frequentist method for hypothesis testing – is common practice for social scientists. Although it is widely used, NHST does have its flaws.

One flaw is the commonly used significance level. This value is used to determine whether or not the null-hypothesis holds. It is, however, a completely arbitrary cut-off value. This sig-nificance level (i.e., α-level) is commonly set to 0.05. Therefore, NHST implies data resulting in p-values of 0.051 and 0.049 are very differ-ent – whereas in fact, they are not. Added to that, this cut-off value will always be reached as the sample size grows sufficiently large

(Wa-genmakers, 2007). Lastly, NHST only allows for limited types of conclusions. We can ei-ther reject or not reject the null-hypothesis. It is therefore not only impossible to gather ev-idence in favour of the null, but also refrains the researcher to find evidence against the null. This is unfortunate, because usually, the null is the most precise hypothesis of the two (Gallis-tel, 2009).

NHST weaknesses are often overlooked be-cause the method is misunderstood by the scientists who use it (Nickerson, 2000). The methodological shortcomings and common misunderstanding of NHST has led to the jour-nal Basic and Applied Social Psychology to ban NHST as it is considered an invalid method. Researchers aiming to publish in this journal are required to apply different inferential statis-tical methodologies (Trafimow, & Marks, 2015). The calculation of a p-value is not the only

(2)

method for hypothesis testing. The Bayesian statistical framework provides an alternative: the Bayes factor. The Bayes factor quantifies the relative evidence between the null and the alternative hypothesis. More specifically, it is the probability of data under H0 divided

by the probability of the data under H1.

Evi-dence is therefore represented in a continuous manner. NHST, on the other hand, represents its evidence in a dichotomous manner, as the result is either significant or non-significant. Consequently, Bayesian statistics enable the re-searcher to gather evidence in favour of both the null and the alternative hypothesis (Wagen-makers, 2007).

Although the Bayes factor provides solu-tions to the shortcomings of p-value NHST, the Bayes factor has some limitations of its own (Dienes, 2011). Sackett, Deeks, and Altman (1996) found that Bayes factors are prone to misinterpretation. For example, Davies, Crom-bie, and Tavakoli (1998) found that a common lack of familiarity with odds ratios result in an unintuitive feel for the size of the differ-ence. More specifically, effect sizes are underes-timated when the Bayes factor is low, whereas effect sizes are overestimated when the Bayes factor is high. Furthermore, Bayesian statistics require more mathematical knowledge than is usually present with most social and be-havioural scientists (Gelman, Carlin, Stern, & Rubin, 2014). To make it easier to work with Bayes factors, it would be desirable to improve its interpretation.

Several suggestions have been made to im-prove the interpretation of probabilities and odds ratios. Brase (2008) studied the effect of different formulations of textual statistical problems on its interpretation. The formulation of statistical problems as ’natural frequencies’

seems to have the best effect on the interpreta-tion. That is, to omit probability information (e.g., 95%) and convey information using fre-quencies (95 out of 100).

Additionally, visualizations are a common solution to facilitate interpretation of numeri-cal information. Visualizations reveal patterns which may go undetected otherwise (Lipkus, & Hollands, 1999). Graphics are excellent repre-sentations of proportions (Spiegelhalter, Pear-son, & Short, 2011; Lipkus et al., 1999). More specifically, pie-charts and stacked bar-charts are most useful for proportion representation (Lipkus et al., 1999; Becker, Kohavi, & Sommer-field, 2001; see Figure 1). For example, pie and bar-charts can be used to depict the evidential proportion of the two hypotheses. Thus, visual-izations could effectively facilitate Bayes Factor interpretation.

Therefore, the aim of this study is to deter-mine whether visualizations improve the in-terpretation of the Bayes Factor. Furthermore, visualization performance will be compared to assess which visualization is most capable of representing Bayes Factors. Based on previ-ously stated literature regarding this topic, it is expected that visualizations aid Bayes Factor interpretation.

Methods

Participants

Participants were recruited through Amazon’s Mechanical Turk (MTurk), and were granted a monetary reward of $1.50 for survey comple-tion. A total of 453 participants completed the survey. MTurk is a service for online research participant recruitment. It is characterized by its very diverse user pool (Casler, Bickel, & Hackett, 2013). This was reflected in our participant pool as there was high dispersion

(3)

around the mean age of 36.6 (SD=10.8), and gender almost equally distributed (54% male). This was beneficial for the study, since effective visualizations should not only be interpretable for people who have a background in statistics.

Materials

The experiment consisted of a 35 item per con-dition survey, where participants had to esti-mate the posterior probability of a correspond-ing Bayes factor. Assumcorrespond-ing equal prior proba-bilities, the posterior probability can be calcu-lated from the Bayes factor (BF) as follows:

Posterior probability= BF BF+1

This Bayes factor has either been repre-sented in its numerical form or as a visualiza-tion. A pilot study was conducted to select the best performing visualizations. As a result, the pie-chart, stacked bar-chart and dot chart were used in this experiment, see Figure 1. The numerical form, together with all types of visualizations, render a total of four conditions. Each participant was randomly assigned to one of four conditions.

Procedure

The survey presented the participant with a hy-pothetical scenario where medical researchers

examine which is the most effective medicine for a given disease. The medicine were given the fictional names Nedril and Mella. The results of this hypothetical experiment yield a Bayes factor indicating which type of medicine the data supports more. The medi-cal researchers nor the participants had prior knowledge on the efficacy of the medicine, which makes the effectiveness of both type of medicine equally likely a priori. The partic-ipant was asked what is the probability that this medicine is the best one, given a certain Bayes factor (either visualized or in numerical form). The correct answer would be the cor-responding posterior probability. Refer to the OSF archive for more detailed information on item structure and formulation.1

Estimated posterior probabilities were com-pared to the true posterior probabilities by means of the mean absolute error (MAE). The MAE measures the average absolute difference between the estimated and true values. Low MAE values would therefore indicate the visu-alization or numerical form is accountable for more accurate Bayes factor estimations, hence facilitating Bayes factor interpretation. Condi-tion performance, as measured by the MAE, was compared using Bayesian t-tests.

Figure 1: Stacked bar-chart, dot-chart and pie-chart for BF10 =5 in a hypothetical medicine efficacy

study. For this example, it is five times more likely that Mella is a better drug than Nedril.

(4)

Furthermore, participants were asked to rate the strength of evidence, based on Jeffreys’ scale interpretation (Jeffreys, 1961). Jeffreys proposed a system for the categorical rating of Bayes factor strength of evidence. The scales of evidence are reported in Table 1. These data have been collected for exploratory purposes, and will be used to assess condition perfor-mance regarding Jeffreys scale estimation. This indicates whether the visualization elicits the correct interpretation of evidential magnitude.

BF10 Evidence against H0

<1 Negative

1, 3.2 Not worth more than a bare mention

3.2, 10 Moderate

10, 32 Strong

32, 100 Very strong

100> Decisive

Table 1: Bayes factor scales of evidence as pro-posed by Jeffreys (1961).

Figure 2: True posterior probabilities plotted against mean posterior probability estimates with 95% confidence intervals.

(5)

Results

Of all 453 participants that had completed the survey, 370 were used for data analysis. Mul-tiple exclusion criteria had been set up before data collection.2 First, participants were ex-cluded from the analysis if they were not from the USA (21) or if English was not their first language (10). This criteria were enforced to ensure everyone understood the questions. If the participant was younger than 18 (1), he/she was excluded for they could possibly not com-prehend the concept of probabilities. Also, out-liers were removed by excluding scores outside the 2.5 SD boundary on completion time and posterior probability estimates (9). Participants could, for example, have used external mea-sures to estimate posterior probabilities if their completion time was too high. Furthermore, because the survey contained an example item with BF = 1, participants had also been ex-cluded if misjudged BF=1 by±25% (42), for it suggests the participant did not pay enough attention.

The results of the experiment are reported in Figure 2. These graphs depict the mean posterior probability estimates with 95% confi-dence intervals, plotted against the true poste-rior probabilities. The diagonal is a hypotheti-cal line where estimates are equal to the true values.

Notable is the difference in variance per condition. The visualizations generally

con-tain less variance than the numerical condition. Mean standard deviation values for the pie-chart (SD = 0.043), bar-chart (SD = 0.103) and dot-chart (SD = 0.121) are considerably lower than the numerical representation (SD= 0.211), as can be seen in Figure 2 as well.

Figure 3: Mean absolute error values with 95% confidence interval for each condition.

For each condition, the MAE of the esti-mates and the true values has been calculated. The results are in Figure 3. The condition MAE values are compared with Bayesian t-tests using the BayesFactor package in R (Morey, & Rouder, 2015). The default Cauchy prior width of √2/2 has been used (Rouder et al., 2009). The alternative hypothesis (H1) states

that there is a difference between the MAE val-ues, whereas the null hypothesis (H0) states

that there is not.

Pie Bar Dot

Bar 1.63×107

Dot 1.90×1013 1.03

Numerical 6.99×1043 3.13×1025 1.22×1023 Table 2: Bayes factors (BF10) for Bayesian t-test hypothesis test.

(6)

It was found that the pie-chart has lower MAE values than any other condition. Com-pared to the numerical condition, the Bayes factor indicated that the data are BF10 =

6.99×1043 times more likely under H1 than

the null-hypothesis of no difference. Also, the bar-chart and dot-chart condition perform better than the numerical condition; the data are BF10 =3.13×1025 and BF10 =1.22×1023

times more likely under H1 than under H0,

respectively. These results are in line with the hypothesis that visualizations improve Bayes factor interpretation. Furthermore, MAE values between visualizations have been com-pared. The pie-chart outperforms the bar-chart and dot-chart; the data are BF10 =1.63×107

and BF10=1.9×1013 times more likely under

H1than under H0, respectively. The Bayesian

t-test results are reported in Table 2 as well.

Exploratory Analysis

The main focus of this article is the assessment of condition performance on Bayes factor inter-pretation by means of posterior probability es-timation. There are, however, some additional questions which were analysed for exploratory purposes.

Firstly, the odd shape of the posterior

prob-ability estimations in the numerical condition. Except for the numerical condition, posterior probability estimates in all other conditions seem relatively converged around the correct answers. However, in the left half of Figure 2 panel A, the estimates seem to flatten at estimated posterior probabilities of 0.5. For these items, the participants were required to estimate the posterior probability of H0when

BF10 was given. To clarify these findings, the

response distribution for the first two items had been plotted (see Figure 4). The response distribution displays the frequency of each pos-terior probability estimate. These plots reveal that participants are almost equally divided in either high or low posterior probability esti-mates.

The response distribution on the first items in the numerical condition could imply numer-ical Bayes factors elicit confusion. More specifi-cally, since this confusions is most prominently found in the left half of the plot, this could indicate that it is unclear what hypothesis the Bayes factor is actually supporting if the re-verse probability was asked. This could either be the result of ambiguity in the numerical Bayes factor or ambiguity in item formulation.

(7)

Secondly, the Jeffreys’ score data had been analysed. The Bayes factors of each item are converted to Jeffreys’ score accordingly. That is, if item 1 corresponded to a Bayes factor of 7, it had been converted to the Jeffreys’ score ’moderate’. If the participant had estimated the evidence to be ’moderate’, this will be regarded a correct estimate. The results are reported in Table 3. It had been found that the numerical condition yields the highest propor-tion of correctly estimated Jeffreys’ scores. This is surprising, given the fact that the numerical condition performed the worst at estimating posterior probabilities. Apparently, numerical Bayes factors convey the strength of evidence according to Jeffreys’ scale most accurately.

Pie Bar Dot Num.

23% 32% 28% 47%

Table 3: Proportion of correctly estimated Jef-frey’s scores.

Additionally, Jeffreys’ scores had been con-verted to an ordinal scale. These have been graphed to analyse the mean Jeffreys’ scores on item level (Figure 5). Since the first and last items gradually contain the strongest evidence,

the ideal curve would be that of a parabola. It is evident that, even on item level, the nu-merical Bayes factor (panel A) performs best in Jeffreys’ score estimation.

Discussion

In this study we investigated whether visual-izations could effectively improve Bayes factor interpretation. Decisive evidence was found that visualizations were responsible for a con-siderable increase in Bayes factor estimation accuracy. Participants estimated the evidential proportion of a Bayes factor much more accu-rate when the Bayes factor was visualized than when being displayed as a number. Added to that, estimate variability was much lower when visualizations were used. This adds to the effectiveness of visualizations for Bayes fac-tor representation, since this suggests accurate estimates are achieved consistently.

Furthermore, decisive evidence was found that the pie-chart was responsible to elicit more accurate Bayes factor interpretation than all other visualizations. This leads to the conclu-sion that the pie-chart is most capable for Bayes factor representation. When comparing the

Figure 5: Mean Jeffreys’ scores per item. ’Very strong evidence’ is coded 4 and ’weak evidence’ is coded 1.

(8)

bar-chart and dot-chart, no evidence for either hypothesis had been found.

Exploratory analysis revealed that poor per-formance in the numerical condition could be attributed to ambiguity in either the Bayes fac-tor or item formulation. The latter, however, seems unlikely for multiple reasons. Firstly, it is stated very clearly what hypothesis the Bayes factor is actually supporting. Added to that, item formulation was equal over the conditions. Furthermore, it has been found that the numer-ical Bayes factor performs best in eliciting the correct interpretation of strength of evidence. This suggests the participants were fully – and correctly – aware of the magnitude of the evi-dence. Thus, if participant were still confusing evidence over the hypothesis, it would sug-gest this confusion is inherent to the numerical Bayes factor, and not due to item formulation. Given that estimate accuracy had dropped due to ambiguity, performance differential be-tween visualizations and numerical Bayes fac-tors could be lower than has actually been found. However, this ambiguity could proba-bly be attributed to inherent Bayes factor char-acteristics. Therefore, it would arguably not be enough to discard the results, as visualizations do not seem to be affected by this ambiguity.

The results from the exploratory analyses lead to some suggestions for future research. It would, for example, be interesting to see why numerical Bayes factors tend to be effective in conveying the correct interpretation of Jeffreys’ scale strength of evidence. This knowledge could, for example, be used for the alteration or creation of visualizations, so that Jeffreys’ scale interpretation would increase for visual-izations as well.

In short, the results are in line with earlier research stating visualizations are useful tools

for the representation of proportion informa-tion. Since Bayes factors are inherently poor in conveying the size of the difference, visual-izations can effectively be used to improve this. Visualizations are therefore a solution for one of the disadvantages of Bayes factors.

References

Becker, B., Kohavi, R., & Sommerfield, D. (2001). Visualizing the simple Bayesian classifier. Information visualization in data mining and knowledge discovery, 18, 237-249.

Brase, G. L. (2008). Frequency interpretation of ambiguous statistical information facil-itates Bayesian reasoning. Psychonomic Bulletin & Review, 15, 284-289.

Casler, K., Bickel, L., & Hackett, E. (2013). Sep-arate but equal? A comparison of partic-ipants and data gathered via Amazon’s MTurk, social media, and face-to-face behavioral testing. Computers in Human Behavior, 29, 2156-2160.

Davies, H. T. O., Crombie, I. K., & Tavakoli, M. (1998). When can odds ratios mislead?. British Medical Journal, 316, 989-991. Dienes, Z. (2011). Bayesian versus orthodox

statistics: Which side are you on?. Per-spectives on Psychological Science, 6, 274-290.

Gallistel, C. R. (2009). The importance of prov-ing the null. Psychological review, 116, 439.

Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2014). Bayesian data analysis (Vol. 2). Boca Raton, FL, USA: Chapman & Hall/CRC.

Jeffreys, H. (1961). Theory of probability. Oxford, UK: Oxford University Press.

(9)

Lipkus, I. M., & Hollands, J. G. (1999). The visual communication of risk. Journal of the National Cancer Institute. Monographs, 25, 149-163.

Morey, R. D., & Rouder, J. N. (2015). BayesFac-tor: Computation of Bayes Factors for Com-mon Designs. R package version 0.9.12-2. Nickerson, R. S. (2000). Null hypothesis signifi-cance testing: a review of an old and con-tinuing controversy. Psychological meth-ods, 5, 241.

Rouder, J. N., Speckman, P. L., Sun, D., Morey, R. D., & Iverson, G. (2009). Bayesian t-tests for accepting and rejecting the null hypothesis. Psychonomic Bulletin &

Re-view, 16, 225-237.

Sackett, D. L., Deeks, J. J., & Altman, D. G. (1996). Down with odds ratios!. Evidence Based Medicine, 1, 164-166.

Spiegelhalter, D., Pearson, M., & Short, I. (2011). Visualizing uncertainty about the future. Science, 333, 1393-1400.

Trafimow, D., & Marks, M. (2015). Editorial. Basic and Applied Social Psychology, 37, 1-2.

Wagenmakers, E. J. (2007). A practical solu-tion to the pervasive problems of p val-ues. Psychonomic bulletin & review, 14, 779-804.

Referenties

GERELATEERDE DOCUMENTEN

Procentueel lijkt het dan wel alsof de Volkskrant meer aandacht voor het privéleven van Beatrix heeft, maar de cijfers tonen duidelijk aan dat De Telegraaf veel meer foto’s van

Het betreft 5 rijen van enkele mooi afgewerkte witte natuurstenen van ongelijke grootte samengehouden door gele zandmortel en onderaan door de bruingele compacte zandleem (zie

Tussen rassen bestaan dus verschillen voor ureumge- halte, maar hoe zit het binnen een ras.. Om dat te bepa- len hebben we 40.992 gegevens geselecteerd

Experts above all recognise added value in including indicators at macro level that fall within the cluster signal and risk behaviour (such as how young people spend their

As described in the hypothesis development section, internal factors, such as prior knowledge, sustainability orientation, altruism and extrinsic reward focus, and

Mr Ostler, fascinated by ancient uses of language, wanted to write a different sort of book but was persuaded by his publisher to play up the English angle.. The core arguments

And as more companies are focusing their online marketing activities on user generated content and thus user generated websites, it raises the question how type of website

When evaluating hypotheses specified using only inequality constraints, the Bayes factor and posterior probabilities are not sensitive with respect to fraction of information in