Bayesian mixture modeling of significant P values : a meta-analytic method to estimate the degree of contamination from H0

(1)

Meta-Analytic Method to Estimate the Degree of

Contamination from H

₀

Eric-Jan Wagenmakers, Quentin Frederik Gronau, Monique Duizer,

Marjan Bakker

University of Amsterdam

Correspondence concerning this article should be addressed to: Eric-Jan Wagenmakers

University of Amsterdam, Department of Psychology Weesperplein 4

1018 XA Amsterdam, The Netherlands E-mail may be sent to EJ.Wagenmakers@gmail.com

Abstract

Most research nearly exclusively focuses on significant p values to decide whether an effect exists or not. However, in the presence of publication bias and questionable research practices, the conclusiveness of significant reported results remains unclear. We propose a Bayesian mixture model for estimating the false discovery rate in specific fields as an alternative to currently popular methods such as p-curve analysis (Simonsohn, Nelson, & Simmons, 2014). The distribution of significant p values under the alterna-tive hypothesis is modeled using a Dirichlet process mixture. This model allows to assess the proportion of true effects in a specific field and enables to estimate how likely a particular study finding stems from the null hy-pothesis. We apply the model to a set of significant t-test p values reported in Wetzels et al. (2011) and to a set of p values from priming studies. The model code of the JAGS implementation is provided in the Appendix. Keywords: publication bias, p-hacking, p-curve, Bayesian mixture model, Dirichlet process

Scientific psychology is experiencing a crisis (e.g., Pashler & Wagenmakers, 2012). Cases of fraud and lack of replicability have severely affected the image of the field and have

(2)

harmed the confidence in existing findings. Causes for the unreplicability are publication bias which is the tendency to only publish significant results whereas non-significant results are discarded (file-drawer problem; Rosenthal, 1979) and questionable research practices. That those phenomena are not far-fetched was already noticed by Sterling (1959) who pointed out that in the field of psychology the majority of published findings report significant statistical tests. Furthermore, John, Loewenstein, & Prelec (2012) showed more recently that a high percentage of psychologists self-admit to having engaged in questionable research practices. These practices comprise for instance phenomena such as hypothesizing after the results are known (HARking ; Kerr, 1998) and p-hacking (Simonsohn, Nelson, & Simmons, 2014). P-hacking increases the odds of obtaining significant results that can be published, while the non-significant results are discarded (file-drawer bias). It is well known for instance, that even if the null hypothesis (H0) is exactly true, an optional-stopping strategy ensures

a significant result (e.g., Wagenmakers, 2007). This leads to the question how to quantify the evidence for the presence of true effects when being presented with significant p values since questionable research practices (and also the probabilistic nature of p values) make them non-conclusive.

When considering one single study, it seems to be nearly impossible to assess whether a true effect is present or not. However, in many fields, several studies exist that report the presence of a specific effect. This does not rule out the possibility that all of the reported significant p values could originate from H0 by chance or because of questionable research

practices, but given a set of significant p values, we can try and estimate the extent to which the findings are contaminated.

Simonsohn et al. (2014) introduced a method called p-curve which allows to assess whether a set of reported significant p values is due to selective reporting or chance. This method exploits the fact that p values that originate from H0 follow a uniform distribution

– which holds for every test statistic for which the sampling distribution is continuous (Becker, 1991). The distribution of p values that originate from the alternative hypothesis (H1) is right-skewed, i.e., if a true effect exists, highly significant results are expected to

occur more frequently. Since most published articles report significant p values, i.e., p ≤ .05, and the main purpose of the p-curve is to detect false positives, the interval of interest for the p-curve is p ∈ (0, .05). Studies that investigate true effects are expected to produce right-skewed p-curves; when no effect is present a uniform p-curve is expected. The most worrisome case arises when a left-skewed curve is obtained: This indicates extensive p-hacking. Simonsohn et al. (2014) proposed a χ2-test for right-skewness to decide whether a set of p values contains evidential value, i.e., true effects are present.

The p-curve appears to be a viable tool for detecting p-hacking and for obtaining an impression about the presence of true effects. However, it has some limitations: First, it will

(3)

often fail to detect a lack of evidential value. A test for right-skewness might be significant although a lot of p values originate from H0. Second, it does not provide an estimate of the

proportion of studies that originate from H0 nor does it allow to estimate the probability

for a particular study to originate from H0.

Bayesian Mixture Model for Significant P -Values

It is expected that in most cases the p-curve will be a mixture between a uniform and a right-skewed distribution. This naturally leads to modeling the observed p values as a mixture of two distributions, one corresponding to H0, i.e., a uniform distribution, and

one corresponding to H1,

p(pi) = φ fH0 + (1 − φ) fH1,

where φ denotes a mixing parameter which can be regarded as an estimate of the proportion of studies originating from H0. One approach is to model the p values using beta

distributions (e.g., Tang, Ghosal, & Roy, 2007), however, Tamhane & Shi (2009) concluded that the beta model is not recommended. Consequently, we do not model p values using beta distributions, but model the probit-transformed p values using normal distributions, which is an approach that has been used e.g., by Tamhane & Shi (2009) and Efron (2012). The uniform distribution of p values under H0 maps onto a standard normal distribution

for the probit-transformed p values.

The exact distribution of p values under H1is more complex and depends on the type

of statistical test used (i.e., the sampling distribution of the test statistic under H1), sample

size, and the values of population parameters that are relevant for the test statistic (Becker, 1991). Furthermore, a given set of p values to investigate will likely contain studies with different sample sizes, different test statistics and maybe also different true effects which implies that the exact distribution of p values will differ across studies. Consequently, the distribution of observed p values from H1 itself will be a mixture of different distributions

where the number of mixture components is expected to grow with the number of observed data points.

In the Bayesian framework, a model for the distribution of probit-transformed p values

from H1 whose complexity can grow with the number of data points can be implemented

using the Dirichlet process (Freedman, 1963; Ferguson, 1973, 1974) as a prior for an infinite normal mixture model. Dirichlet process mixtures for estimating the false discovery rate have been used before (e.g., Do, M¨uller, & Tang, 2005; Tang et al., 2007). However, these models in their proposed forms are not suited for modeling significant p values, since they

(4)

do not respect the truncation of observed p values at .05. We present a model which is suited to model significant p values and provide the code of the JAGS (Plummer, 2003) implementation which allows to easily apply the model.

The Dirichlet Process

The Dirichlet process can be used as a prior on discrete probability distributions, i.e., every draw from a Dirichlet process yields a discrete probability distribution over countably infinite values. Although the Dirichlet process yields discrete probability distributions over countably infinite values, most of them will have probability mass close to zero. Where the probability mass is located is determined by a base distribution G0, the amount of values

with probability mass close to zero is controlled by a precision parameter α. A common application of the Dirichlet process is to use it as a prior on the parameters of an infinite mixture model (e.g., Navarro, Griffiths, Steyvers, & Lee, 2006) where we observe data and assume that some of them will be similar to each other without placing an upper bound on the number of groups. Instead, we expect that as we observe more data, the number of

groups will grow which is exactly the behavior we expect p values from H1 to show. We

assume that every observed probit-transformed significant p value from H1 (i.e., Φ−1(pH_i 1))

is drawn from a truncated normal distribution N (·|θi)T (,−1.64485) where the parameter

vec-tor θi consists of a mean and a variance component. The normal distribution is truncated

at Φ−1(.05) = −1.64485, since we only consider significant p values. The parameter vec-tor θi for every probit-transformed p value is drawn from a probability distribution G(·).

This probability distribution itself is a draw from a Dirichlet process with concentration parameter α and base distribution G0, i.e, the probability distribution G(·) is not fixed.

Φ−1(pH1

i )|θi ∼ N (·|θi)T (,−1.64485)

θi|G ∼ G(·)

G|G0, α ∼ DP (·|G0, α).

We assume independent base distributions for the mean and the variance component of θi: For the mean, we assume a truncated standard normal distribution N (0, 1)T (−6,0) and

for the inverse variance, we use a truncated gamma distribution Gamma(0.001, 0.001)_{T (1, )}

1_.

1

The normal distribution is truncated at zero because right-skewed distributions on the interval (0,1) map onto normal distributions with mean smaller than zero when probit-transforming the values. The lower truncation at -6 is due to practical reasons because smaller values are very unlikely with regard to possible observed effect sizes. The truncation of the gamma distribution at one implies that the variance of

(5)

The Dirichlet process can be represented in different ways (e.g., Blackwell & Mac-Queen, 1973), we will focus on the so-called stick-breaking construction (Sethuraman, 1994) which has the advantage that it allows to implement the Dirichlet process in JAGS (Plum-mer, 2003) and WinBUGS (Spiegelhalter, Thomas, Best, & Lunn, 2003), for examples see §6.7 of Congdon (2006), Ohlssen, Sharples, & Spiegelhalter (2007), and §11.8 of Lunn, Jack-son, Best, Thomas, & Spiegelhalter (2012). The stick-breaking construction exploits the fact that sampling from the Dirichlet process prior can be broken into two parts, where one part corresponds to generating weights (i.e., probability mass values for the discrete probability mass distribution) and the other to generating the corresponding locations of the probability mass.

The stick-breaking construction is illustrated in Figure 1. We start by sampling a value from the base distribution G0. In order to obtain the corresponding probability mass,

assume that we have a stick of length one which is the total probability that can be assigned to the countably infinite number of values. Then we sample a value from the following beta distribution

qi∼ Beta(1, α),

break the stick at this value and set the first weight π1 corresponding to the first

value from G0 equal to the length of the stick part that we broke off, i.e., π1 = q1. More

illustrative, we take the part of the stick that we broke off, align it vertically and place

it on the value sampled from the base distribution G0. Afterwards, we sample another

value from G0 and generate another q. We take the remaining part of the stick which has

length (1 − q1), break off the proportion q2 and in this way obtain the weight for the second

component (i.e., the length of the stick part broken off) which is π2 = q2(1 − q1). Again,

we can think of placing this stick part on the location of the value sampled from the base distribution G0. In this manner we continue to obtain pairs of values and corresponding

weights which, when doing this infinitely often, constructs a probability mass distribution over a countably infinite number of values. The weight for the j-th component is computed via πj = qj j−1 Y k=1 (1 − qk).

the probit-transformed p value distributions under H1cannot be smaller than the variance of the standard normal distribution corresponding to H0.

(6)

G

₀

q

₁

q

₂

q

₃

...

(

1 - q

1

)

1 p

₁

P

(

1 - q

k

)

k=1 2

p

₂

P

(

1 - q

k

)

k=1 3

p

₃

q

₁

q

₂

q

₃

Probability

(7)

Since we started with a stick of length one, it is guaranteed that the length of the broken stick parts (i.e., the weights) will add up to one yielding a proper probability mass distribution. The individual parameter values θifor the observed p values from H1 are then

sampled from the so constructed probability mass distribution which leads to a clustering, i.e., parameter values with larger weights will be sampled more frequently. For practical purposes, the stick-breaking process will be truncated (note that the maximum number of possible clusters for a given data set is equal to the number of data points). This means that we stop to break the stick at some point and set the last component equal to the length of the remaining stick which ensures a proper probability mass distribution.

The stick-breaking construction illustrates that as the number of components in-creases, the weights will decrease stochastically where the rate of this decrease is determined by the precision parameter α. Smaller values of α lead to larger parts being broken off at every step (i.e., only few clusters have a high probability of being used for a given data set) whereas larger α values lead to a smaller number of values with mass close to zero (i.e., larger number of clusters). As Antoniak (1974) pointed out, rather than pre-specifying α beforehand, a hyperprior can be placed on α which allows to learn the concentration of the Dirichlet process from the data. For our model, we used a uniform prior over a plausible range of values, i.e., α ∼ Unif(0.5, 7).

Simulation Results

In order to test our model, we generated three data sets from independent samples t-tests each consisting of 5,000 significant p values. One t-test p value was obtained by sampling two groups from normal distributions with standard deviations equal to one and means corresponding to a specified effect size; each group consisted of 250 participants. If a p value was not significant, it was discarded, two new groups were sampled and a new p value was obtained. In this manner, we generated one data set with 50% H0 contamination

(i.e., 2,500 p values were generated from H0), for effect sizes equal to 0.15, 0.3, and 0.45.

The results of fitting the model are depicted in Figure 2. Each column corresponds to the results for one data set: The first column shows the results for the data set with effect size equal to 0.15, the second column the results for effect size equal to 0.3, and the third column the results for effect size equal to 0.45. The first row displays the posterior distributions for the H0 assignment rate (i.e., φ): For effect sizes of 0.3 and 0.45, the model

adequately recovers the true H0 contamination rate of 0.5. Furthermore, as the effect

size increases, the uncertainty about the H0 assignment rate decreases (i.e., the posterior

variance decreases). For effect size 0.15, the H0assignment rate seems to be underestimated,

most posterior mass is below 0.5; however there is much more uncertainty, the posterior variance is relatively large. Intuitively, it is expected that as the effect size gets smaller and

(8)

0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 H0 Assignment Rate Posterior Density 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 35 H0 Assignment Rate Posterior Density 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 H0 Assignment Rate Posterior Density 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 Significant P Values Probability H0 Assignment 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 Significant P Values Probability H0 Assignment 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 Significant P Values Probability H0 Assignment 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05

Observed P Value Quantiles

Predicted Quantiles 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05

Predicted Quantiles

Effect Size: 0.15 Effect Size: 0.3 Effect Size: 0.45

Figure 2. Depiction of the simulation results. The first column corresponds to a set of 5,000 p

values generated from independent samples t-tests with effect size equal to 0.15, the second column to effect size 0.3, and the third column to effect size 0.45. All data sets were generated with a 0.5

H0contamination rate. The first row depicts the posterior distributions for the H0assignment rate,

the second row the individual H0assignment probabilities (p values from H0 are colored red, those

from H1are colored grey), and the third row displays Q-Q plots for a comparison between observed

(9)

smaller, it will be more difficult to estimate the H0 assignment rate adequately, since the

components of the mixture model become more similar, hence harder to discriminate. It is undesirable that the H0 assignment rate is not recovered adequately for a small effect size

of 0.15. However, if the decision is between under- or overestimating the H0 contamination,

we think it is better to be rather conservative which is the behavior the model shows (at least for this simulated data set).

The second row of Figure 2 shows the H0 assignment probabilities for the individual

p values. P values originating from H0 are colored red, p values from H1 are colored grey.

As the effect size increases, the model discriminates between p values from H0 and H1more

and more accurately.

The third row of Figure 2 displays Q-Q plots for a comparison between the distribution of observed p values and the distribution of generated posterior predictive p values which allows to obtain an impression about how well the model fits the data. If the distributions were exactly the same, the Q-Q plots would be straight lines with slopes equal to one. The Q-Q plots suggest a good fit of the model for all three data sets.

Analysis of 587 T-Test P -Values

We applied the model to a set of p values from Wetzels et al. (2011) who collected all of the empirical results in the 2007 volumes of PBR and JEP:LMC that were analyzed with a t-test which yielded 855 t-tests from 252 articles. We considered the significant subset of the 855 t-test p values which resulted in a set of 587 significant p values.

Figure 3 displays the results of the application of our model. The upper-left panel shows the distribution of the 587 significant p values. This distribution is clearly right-skewed, most p values are smaller than .005 which suggests a small percentage of H0

con-tamination. The posterior distribution of φ, i.e., the H0 assignment rate, corroborates this

visual first impression (upper-right panel of Figure 3): The contamination rate is estimated to be very small, a 95% highest density interval ranges from 0.003 to 0.150. When consider-ing the H0assignment probabilities for the individual p values (lower-left panel of Figure 3),

we observe that none of these is estimated to be above .5 which means that we are for all of the observed p values more confident that they stem from H1 than from H0. The

lower-right panel of Figure 3 displays a Q-Q plot for a comparison between the distribution of observed p values and the distribution of generated posterior predictive p values. The Q-Q plot suggests a good fit of the model. Note that for the given set of p values, we do not assume the existence of one “true” underlying effect size, since the studies are selected by the journals they appeared in and not necessarily by topic. Hence, it is expected that the distribution of p values that stem from H1 is a mixture of many components which led us

(10)

0.00 0.01 0.02 0.03 0.04 0.05 0 50 100 150 200 250 300 350 Significant P Values Number of P Values T-Test P-Values (N=587) 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 H0 Assignment Rate Posterior Density T-Test P-Values 95% 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 Significant P Values Probability H0 Assignment T-Test P-Values 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05

Predicted Quantiles

T-Test P-Values

Figure 3. Depiction of the results for the t-test p values. The upper-left panel shows the distribution

of observed p values, the upper-right panel depicts the posterior of the H0 assignment rate. The

lower-left panel plots the individual H0 assignment probabilities and the lower-right panel displays

a Q-Q plot for the comparison between the observed and posterior predictive p value distribution.

many component mixture of p values that stem from H1. The Q-Q plot suggests that our

model is able to account for heterogeneity quite accurately.

Analysis of Priming and Control Studies

We also applied the model to a set of p values from priming studies and a set of “control” p values. We collected all articles published by a large group of prominent “social priming” researchers. From each experiment, we distilled a single significant p value, in line with the p-curve instructions from Uri Simonsohn (Simonsohn et al., 2014). Each p value was considered by three raters; occasional differences of opinion were readily resolved by discussion. As a control for the results from the social priming studies discussed above, we also distilled single significant p values from experiments reported in yoked control studies. That is, every time we selected a priming study we also selected, as a yoked control, another study on a different topic published immediately after the priming study. The procedure

(11)

0.00 0.01 0.02 0.03 0.04 0.05 0 20 40 60 80 100 120 Significant P Values Number of P Values Priming Studies (N=268) Control Studies (N=227) 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 H0 Assignment Rate Posterior Density Priming Studies Control Studies 95% 95% 0.00 0.01 0.02 0.03 0.04 0.05 0.0 0.2 0.4 0.6 0.8 1.0 Significant P Values Probability H0 Assignment Priming Studies Control Studies 0.00 0.01 0.02 0.03 0.04 0.05 0.00 0.01 0.02 0.03 0.04 0.05

Predicted Quantiles

Priming Studies Control Studies

Figure 4. Depiction of the results for the priming and control p values. The upper-left panel

shows the distributions of observed p values, the upper-right panel depicts the posteriors of the H0

assignment rate. The lower-left panel plots the individual H0assignment probabilities and the

lower-right panel displays a Q-Q plot for the comparison between the observed and posterior predictive p value distributions.

for distilling p values from the yoked controls was identical to the one used for the priming studies. This procedure yielded 268 significant priming p values and 227 significant control p values.

The results of the analysis are depicted in Figure 4. The upper-left panel shows the distributions of priming and control p values, both of them are right-skewed. However, the right-skewness of the distributions is not as pronounced as for the t-test p values presented in the previous section. Furthermore, the distribution corresponding to the priming studies is less right-skewed than the distribution corresponding to the control studies. The upper-right panel of Figure 4 displays the posterior distributions of the H0 assignment rate. For

both sets of p values, a substantial amount of H0 contamination is estimated. The H0

assignment rate for the priming p values is estimated to be larger than for the control p values, a 95% highest density interval ranges from 0.379 to 0.839 for the priming p values and from 0.209 to 0.561 for the control p values. The H0 assignment probabilities for the

(12)

individual p values (lower-left panel of Figure 4) are estimated to be above .5 for most of the priming and control p values. The Q-Q plot (lower-right panel of Figure 4) indicates a good fit for the control p values whereas the fit for the priming p values is suboptimal. This is an important point: The model suggests that a high percentage of the p values from the priming studies is likely to stem from H0, however, as the Q-Q plot indicates, the model

does not seem to be able to fully account for the observed data pattern. Furthermore, the posterior distribution of the H0 assignment rate is much more spread out than for the

analysis of the t-test p values. Hence, the results of this analysis should be interpreted carefully. Nevertheless, this analysis suggests that it might be worthwhile to take a closer look at the effects reported in this set of priming studies.

Concluding Comments

Drawing conclusions from hypothesis tests is one of the most fundamental building blocks of science. However, there are several reasons why a hypothesis test can indicate the presence of an effect when there is none. A tool that allows to assess the trustworthiness of a set of hypothesis tests, could improve scientific practice by detecting spurious phenomena or questionable research practices. Simonsohn et al. (2014) introduced p-curve as a tool to assess whether a set of significant hypothesis tests contains evidential value. In this article, we have proposed a Bayesian mixture model which allows to grade the evidence that a set of studies contains in a more fine-grained manner by providing estimates for the H0

contamination rate and enabling to estimate how likely a particular study finding stems from the null hypothesis. By providing the JAGS code of the model, we hope to encourage researchers to apply the model within their field of interest to grade the existing evidence.

References

Antoniak, C. E. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The annals of statistics, 2 , 1152–1174.

Becker, B. J. (1991). Small–sample accuracy of approximate distributions of functions of observed probabilities from t tests. Journal of Educational Statistics, 16 , 345–369.

Blackwell, D., & MacQueen, J. B. (1973). Ferguson distributions via p´olya urn schemes. The Annals

of Statistics, 1 , 353–355.

Congdon, P. (2006). Bayesian statistical modelling (2nd ed.). Chichester, UK: Wiley.

Do, K.-A., M¨uller, P., & Tang, F. (2005). A Bayesian mixture model for differential gene expression.

Journal of the Royal Statistical Society: Series C (Applied Statistics), 54 (3), 627–644.

Efron, B. (2012). Large-scale simultaneous hypothesis testing. Journal of the American Statistical Association, 99 , 96 – 104.

(13)

Ferguson, T. S. (1973). A Bayesian analysis of some nonparametric problems. The annals of statistics, 1 , 209–230.

Ferguson, T. S. (1974). Prior distributions on spaces of probability measures. The annals of

statistics, 2 , 615–629.

Freedman, D. A. (1963). On the asymptotic behavior of Bayes’ estimates in the discrete case. The Annals of Mathematical Statistics, 34 , 1386–1403.

John, L. K., Loewenstein, G., & Prelec, D. (2012). Measuring the prevalence of questionable research practices with incentives for truth–telling. Psychological Science, 23 , 524–532.

Kerr, N. L. (1998). HARKing: Hypothesizing after the results are known. Personality and Social Psychology Review , 2 , 196–217.

Lunn, D., Jackson, C., Best, N., Thomas, A., & Spiegelhalter, D. (2012). The BUGS book: A practical introduction to Bayesian analysis. Boca Raton (FL): Chapman & Hall/CRC.

Navarro, D. J., Griffiths, T. L., Steyvers, M., & Lee, M. D. (2006). Modeling individual differences using Dirichlet processes. Journal of Mathematical Psychology, 50 , 101–122.

Ohlssen, D. I., Sharples, L. D., & Spiegelhalter, D. J. (2007). Flexible random-effects models using Bayesian semi-parametric models: applications to institutional comparisons. Statistics in medicine, 26 , 2088–2112.

Pashler, H., & Wagenmakers, E.-J. (2012). Editors’ introduction to the special section on replicability in psychological science: A crisis of confidence? Perspectives on Psychological Science, 7 , 528– 530.

Plummer, M. (2003). JAGS: A program for analysis of Bayesian graphical models using Gibbs

sampling. In K. Hornik, F. Leisch, & A. Zeileis (Eds.), Proceedings of the 3rd international

workshop on distributed statistical computing. Vienna, Austria.

Rosenthal, R. (1979). An introduction to the file drawer problem. Psychological Bulletin, 86 , 638–641.

Sethuraman, J. (1994). A constructive definition of Dirichlet priors. Statistica Sinica, 4 , 639–650. Simonsohn, U., Nelson, L. D., & Simmons, J. P. (2014). P-curve: A key to the file drawer. Journal

of Experimental Psychology: General , 143 , 534-547.

Spiegelhalter, D. J., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS version 1.4 user manual. Cambridge, UK: Medical Research Council Biostatistics Unit.

Sterling, T. D. (1959). Publication decisions and their possible effects on inferences drawn from tests of significance–or vice versa. Journal of the American Statistical Association, 54 , 30–34. Tamhane, A. C., & Shi, J. (2009). Parametric mixture models for estimating the proportion of true

(14)

Tang, Y., Ghosal, S., & Roy, A. (2007). Nonparametric Bayesian estimation of positive false discovery rates. Biometrics, 63 (4), 1126–1134.

Wagenmakers, E.-J. (2007). A practical solution to the pervasive problems of p values. Psychonomic Bulletin & Review , 14 , 779–804.

Wetzels, R., Matzke, D., Lee, M. D., Rouder, J. N., Iverson, G. J., & Wagenmakers, E.-J. (2011). Statistical evidence in experimental psychology: An empirical comparison using 855 t tests. Per-spectives on Psychological Science, 6 , 291–298.

Appendix: JAGS Implementation

In order to use the model code below, the following data need to be passed to JAGS: qp, i.e., probitized p values (e.g., in R can be obtained via qnorm(pValues)); n (i.e., number of observed p values); nMaxClusters, i.e., the number of maximal clusters for the Dirichlet process mixture representation of H1 which determines where to truncate the stick-breaking

process. A natural upper bound is given by the number of observations, however, in most cases a much smaller number will be needed. We used nMaxClusters between 10 and 20. It is recommended to start with a small number and then check whether at any iteration of the chain the maximum number of clusters is used. If this is the case, the number should be increased until there is a buffer.

model {

### Likelihood

for (i in 1:n) {

ind[i] ~ dbern(phi) # phi is the H0 assignment rate

z[i] ~ dcat(proportionsStick[]) # indicator for cluster of H1 # add one so that the indicator for the cluster of H1

# starts with 2 (1 is needed to indicate H0) z1[i] <- z[i] + 1

# if assigned to H0 z2 = 1, else if assigned

# to H1, assign to cluster z1 \in [2,3,..., nMaxClusters+1] z2[i] <- ifelse(ind[i] == 1, 1, z1[i])

qp[i] ~ dnorm(mu[z2[i]], lambda[z2[i]])T(,-1.64485) }

(15)

predind ~ dbern(phi)

predz ~ dcat(proportionsStick[]) predz1 <- predz + 1

predz2 <- ifelse(predind == 1, 1, predz1)

predqp ~ dnorm(mu[predz2], lambda[predz2])T(,-1.64485)

### Stick-breaking process (truncated Dirichlet process)

# generate proportions to break the remaining stick

for (j in 1:(nMaxClusters-1)) {

# proportions of the remaining stick parts proportionsRemainingStick[j] ~ dbeta(1, alpha) }

# Break the stick (translate the proportions to the proportions with regard # to the original stick, length original stick = 1)

proportionsStick[1] <- proportionsRemainingStick[1]

for (j in 2:(nMaxClusters-1)) {

# calculate proportions of the original stick

proportionsStick[j] <- proportionsRemainingStick[j]

* prod(1 - proportionsRemainingStick[1:(j-1)]) }

sumProportionsStick <- sum(proportionsStick[1:(nMaxClusters-1)]) # make sure that proportions sum to 1

proportionsStick[nMaxClusters] <- 1 - sumProportionsStick

### Priors

(16)

for (j in 1:nMaxClusters) { muH1[j] ~ dnorm(0,1)T(-6,0) lambdaH1[j] ~ dgamma(0.001,0.001)T(,1) } mu[1] <- 0 mu[2:(nMaxClusters+1)] <- muH1 lambda[1] <- 1 lambda[2:(nMaxClusters+1)] <- lambdaH1

# Prior for the precision of the Dirichlet process, lager values lead to # more clusters, smaller values lead to less distinct clusters

alpha ~ dunif(.5,7)