• No results found

Pulling the rabbit out of the hat : how questionable research practices can make insignificant results look significant

N/A
N/A
Protected

Academic year: 2021

Share "Pulling the rabbit out of the hat : how questionable research practices can make insignificant results look significant"

Copied!
26
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Pulling the rabbit out of the hat: How questionable research

practices can make insignificant results look significant

Researchmasterthesis Psychological Research Methods

Name student: Paul Lodder

Student ID: 6078362

Internal supervisor: Raoul Grasman External supervisor: Jelte Wicherts

(2)

Abstract

Recently, researchers have become more aware of the rareness of direct replications within psychology. This has ignited a number of large-scale replication studies, such as the Many Labs

Replication Project (MLRP; Klein et al, 2014a). One of the replications in that project, the

currency priming replication, showed a null effect, while the earlier literature using currency priming manipulations showed significant effects. This incongruence might be explained by “hidden” moderators (Moderator scenario), publication bias or the use of questionable research

practices (PB/QRP scenario) (Ioannidis, 2005; Simmons, Nelson & Simonsohn, 2011). QRP’s

have been shown to be used by the majority of researchers (John, Loewenstein & Prelec, 2012)

and simulation studies have shows that QRP’s can lead to substantial bias in effect size estimates

and to an inflation of false positive findings (Ioannidis, 2005; Simmons, Nelson & Simonsohn,

2011).

The first aim of our study is to investigate what influence QRP’s and publication bias can have on conclusions drawn from the analysis of real data. We simulated different QRP and publication bias scenarios to the MLRP currency priming data that showed a null effect. In line

with earlier work (Bakker, van Dijk & Wicherts, 2012) we expected that scenarios with

publication bias and QRP’s would lead to more false positives and more effect size bias. Our results show that more QRP’s lead to higher amounts of false positives and this pattern became more extreme in case of publication bias. Without publication bias, effect size estimates did not show any bias, regardless of the amount of QRP’s used. If publication bias was present, however, effect size estimates became severely biased, even after the application of one QRP.

The second aim of our study was to investigate the incongruence between the MLRP data and the literature on currency priming by performing a meta-analysis inspecting whether the findings can best be explained by the moderator- or the PB/QRP scenario. In our meta-analysis we found a significant positive effect for the currency priming manipulation. However, we observed publication bias and an excess of significant results, possibly indicating the use of QRP’s. After only taking high powered studies into account, the effect size estimate decreased from Hedges’ g = .44 to .03. This difference in effect size could be explained as bias resulting from the use of QRP’s, although this matter remains inconclusive. If publication bias is present in a field, then we advise to only include high-powered studies in meta-analyses, because lower

powered studies can provide biased effect size estimates (Nuijten, van Assen & Wicherts, in

preparation). We end with a number of recommendations to lower the risk of publication bias and the use of QRP’s.

(3)

Introduction

Science is in a reproducibility crisis (Lehrer, 2010; Yong, 2012). Over the past few years, an increasing number of studies showed that so called ‘established facts’ proved difficult to replicate (Begley & Ellis, 2012; Prinz, Schlange, & Asadullah, 2011). This crisis is evident in numerous

scientific disciplines (Ioannidis, Ntzani & Trikalinos, 2001; Rousseau & Porto, 1970; Taubes,

1993), but perhaps most obvious in experimental social psychology (Pashler & Wagenmakers,

2012). The problems regarding the reproducibility in that field have received a lot of attention,

partly due to the high profile fraud case by Diederik Stapel, a social psychologist who has

committed fraud in over fifty scientific publications (Levelt, Noort & Drenth, 2012). Other

examples of well-known experiments that could not be replicated include studies on extrasensory

perception (Wagenmakers, Wetzels, Borsboom & van der Maas, 2011), unconscious thought

theory (Huizenga, Wetzels, van Ravenzwaaij & Wagenmakers, 2012) and social priming (Doyen,

Klein, Pichon, & Cleeremans, 2012; Shanks et al., 2013).

Within psychology, the dominant approach to draw conclusions from the data is null hypothesis significance testing (NHST; Cumming et al., 2007). A researcher who uses NHST first proposes a null hypothesis of no effect and then computes the probability of finding at least the observed effect, given that the null hypothesis is true. Subsequently, the researcher checks whether this probability, also called the p-value, is below a predetermined threshold called the significance level. If this is the case, then the researcher rejects the null hypothesis and concludes that the results are significant.

A particular significance level denotes the probability that the researcher falsely rejects the null hypothesis (i.e. a false positive finding). A false positive is an effect believed to be genuine, but in fact merely the result of chance factors. As a convention, most researchers accept a false positive rate of 5%. Yet most researchers are unaware that this error rate in practice often

exceeds 5% (Ioannidis, 2005; Simmons, Nelson & Simonsohn, 2011), which means that false

positive findings are more likely than most researchers believe them to be. This led for instance

John Ioannidis (2005) to claim that most published research findings are false. The probability

that a particular research claim is indeed true (in the sense that the research finding accurately describes the underlying effect) does not only depend on the chosen significance level, but also

on the statistical power and the prior probability of that finding being true (Wacholder, Chanock,

Garcia-Closas & Rotman, 2004).

This underestimated chance on a false positive finding is even more worrisome due to the

fact that direct replications are uncommon in psychology (Makel, Plucker, & Hegarty, 2012) and

social science in general1 (Mahoney, 1985; Schmidt, 2009). In a direct replication, researchers aim

to copy the methodology of the original study as closely as possible2. Without direct replications,

researchers have to base their conclusions about a particular research question on a single study that may even be a false positive. Unfortunately, scientific journals prefer to publish ‘new’

findings and are therefore hesitant to accept direct replications (Neuliep and Crandall, 1990),

which discourages researchers to perform such replication studies and impedes publication of replications if they are done.

Besides publishing new findings, journals also prefer to publish results that are positive in the sense that the null hypothesis is rejected. In fact, having significant results increases the

possibility of publishing those results (Mahoney, 1977) and hence within psychology more than

90% of the published scientific articles report significant results (Fanelli, 2010). This high

1 This shortage of direct replications, however, does not apply to every field within psychology. Direct replications

have for instance been much more common in the cognitive neurosciences than in, say, experimental social psychology.

2 As opposed to a conceptual replication, where researchers investigate a concept similar to that of the replicated

study, but use a different methodology (e.g. a different manipulation or dependent measure). See Pashler & Harris (2012) for a discussion on the value of direct vs. conceptual replications.

(4)

percentage is at odds with the low statistical power encountered in average psychology studies (Bakker, van Dijk & Wicherts, 2012), which makes it likely that non-significant findings have

disappeared in a file drawer (Rosenthal, 1979). This phenomenon is called publication bias and its

existence is corroborated by estimates that at least 50% of psychological studies remain

unpublished (Cooper, DeNeve & Charlton, 1997; Coursol & Wagner, 1986; Shadish, Doherty &

Montgomery, 1989).

Because most journals prefer new and positive findings, and because a career in science

depends on publishing in journals (Nosek, Spies & Motyl, 2012), researchers are incentivized to

conduct novel experiments, avoid direct replications, and maximize the number of significant findings. Metaphorically, one could view these incentives as the rules by which researchers have

to play the game called science (Bakker et al., 2012; Mahoney, 1976). If researchers want to win

this game, then they should play strategically and focus on achieving as much novel findings as possible with p-values below 0.05. Given that the conventional 5% significance level in practice

often implies a false positive rate larger than 5% (Ioannidis, 2005), researchers could play the

science game strategically by making decisions that increase the chance on false positives and thus the chance on a p-value below 0.05.

During a scientific experiment, researchers have numerous decisions to make. Some of these decisions can increase the chances of finding false positives, especially if researchers do not

make these decisions beforehand (Simmons, Nelson & Simonsohn, 2011). For instance, during

data collection, researchers might analyze their data to see whether the results are significant and if not, decide to collect more data until significance is reached. Other examples are reporting only studies that “work” (i.e. studies that show statistically significant effects), excluding observations based on post hoc criteria or testing multiple dependent variables and reporting only those that are significant. These decisions are called questionable research practices (QRPs) and it turns out

that the majority of psychological researchers admit to have used at least one of them (John,

Loewenstein & Prelec, 2012).

In a simulation study, Bakker and colleagues (2012) showed that the use of QRPs and the

presence of publication bias could lead to biased effect size estimates and increased false positive rates. They proposed four research strategies based on the presence or absence of publication bias and QRP’s. Researchers with strategy 1 perform one large study and publish it. In strategy 2, researchers perform one large study, but also use up to three popular QRP’s. Researchers who use strategy 3 do not use QRP’s, but perform five small studies instead of one large study and publish the first significant result they find. Lastly, with strategy 4 researchers perform five small studies, use up to three popular QRP’s, and publish the first significant result. It turned out that the bias in effect size is largest when the researchers adopt strategy 3 or 4. Strategy 1 did not lead to bias and strategy 2 only when the true effect size was small. Furthermore, strategies 2 and 4 (the use of QRP’s) resulted in the highest proportion of false positives. The authors conclude that if researchers play the science “game” strategically, then they should choose the optimal research strategies to obtain a significant effect. This might also explain why so many studies within psychology are underpowered, yet frequently report statistically significant results.

A limitation of that simulation study (Bakker et al., 2012) is that it remains unclear what

influence different amounts of QRP’s have on bias in effect size and the false positive rate. Therefore, in the present study we aim to distinguish between different amounts of QRP’s. We also extend upon the results of Bakker and colleagues by using real data instead of simulated data. Specifically, we will use a real dataset containing a convincing null effect to investigate whether we can create a significant effect by applying questionable research practices to samples of the data. We will sample according to a total of eight different scenarios (see Table 1), ranging from clean samples with no QRPs, to highly manipulated samples with a lot of QRPs and the presence of publication bias. We hypothesize that highly manipulated scenarios with more QRPs and publication bias will lead to inflated effect sizes, more false positive findings, and a mean effect size estimate that significantly differs from zero.

(5)

In our study, we will use null effect data from a large replication (N=6344; Klein et al., 2014a) of a currency priming study (Caruso, Vohs, Baxter & Waytz, 2013). For comparison with the literature, a second aim of our project is to conduct a meta-analysis on earlier published experiments that use such a currency priming manipulation. Most of these reject the null

hypothesis of no effect, which contradicts the results of the large replication project (Klein et al.,

2014a) in which only one of the 36 replication studies rejected the null hypothesis. These

conflicting results may therefore indicate that the earlier published literature on currency priming suffers from publication bias. Another possibility concerns the presence of “hidden moderators” or important differences between the currency priming studies that we do not yet understand but have little bearing on QRP’s. We will use meta-regression to investigate the influence of some moderators (e.g. prime type or dependent variable), but the possibility remains that moderators of which we are not aware explain differences between studies.

We will investigate the presence of publication bias by looking – among others – at funnel

plot asymmetry (Sterne & Egger, 2001). We will also assess the presence of selective analysis

reporting (p-hacking) by inspecting the distribution of significant p-values (Simonsohn, Nelson &

Simmons, 2013) and testing whether the number of significant p-values is larger than expected

given the power of the currency priming studies (Ioannidis & Trikalinos, 2007). In this article, we

aim to assess whether the conflicting results of the published studies on currency priming with the Many Labs Replication Project (Klein et al, 2014a) can better be explained by a (“hidden”) moderator (Moderator scenario), or by publication bias and questionable research practices (PB/QRP scenario).

Method

Materials & sample characteristics

The original currency priming study (Caruso, Vohs, Baxter & Waytz, 2013) that was chosen to be

replicated showed that priming people with images of their home country’s currency, made those people endorse their country’s prevailing social system more strongly than people who were primed with a noisy control image (n = 30, t = 2.12, p = .043, d = 0.80). The effect size of 0.80 seems impressive, but its confidence interval shows us a more nuanced picture (95% CI = [0.05,

1.54]). The Many Labs Replication Project (MLRP; Klein et al., 2014a), did not replicate the

currency priming effect (n = 6333, t = -.79, p = .83, d = .01, 99% CI = [-0.06, 0.09]).

In our study, we used this MLRP dataset (Klein et al., 2014b), which is freely available for

download from the Open Science Framework website (http://osf.io/project/WX7Ck/). The

dataset contains replications of 15 psychology experiments, conducted by 36 different teams all over the world, resulting in a total sample size of 6344 participants. In our study we focused on the replications of the currency priming experiment, thereby ignoring the data of the other 14 experiments.

Procedure

In order to inspect the possibility and severity of publication bias in this field of study, we performed a meta-analysis of the (un)published literature on currency priming (Boucher & Kofos, 2012; Caruso, Vohs, Baxter & Wayte, 2013; Chatterjee & Rose, 2012; Chatterjee, Rose & Sinha, 2013; Gąsiorowska & Hełka, 2012; Gasiorowska, Zaleskiewicz & Wygrab, 2012;

Gasiorowska, Zaleskiewicz, Wygrab, Chaplin & Vohs, 2014; Gino & Mogilner, 2013; Hansen, Kutzner & Wänke, 2013; Liu, Smeesters & Vohs, 2012; Mogilner, 2010; Mogilner & Aaker, 2009; Mukherjee, Manjaly & Nargundkar, 2013; Roberts & Roberts, 2012; Tong, Zheng & Zhao, 2013;

Vohs, Mead & Goode, 2006; 2008; Zaleskiewicz et al, 2013; Zhou, Vohs & Baumeister, 2008).

These studies are similar in that all experiments use a currency priming manipulation, but they differ with respect to the dependent measures. Although the homogeneous null effect of the MLRP data (Klein et al., 2014a) would make a fixed effects meta-analysis seem appropriate, we chose to conduct our meta-analysis according to a random effects model based on the guidelines

(6)

provided by Borenstein and colleauges (2009). They state that one requirement for a fixed effects meta-analysis is that the included studies are functionally equivalent. However, if the studies included in a meta-analysis “have been performed by researchers operating independently, it would be unlikely that all the studies were functionally equivalent. Typically, the subjects or interventions in these studies would have differed in ways that would have impacted on the results, and therefore we should not assume a common effect size. Therefore, in these cases the random effects model is more easily justified than the fixed effects model” (Borenstein et al., 2009, p. 84-84). The Many Labs replication project showed a homogeneous null effect but only focused on one of the many dependent variables reported in the literature. Therefore, the MLRP studies might not be functionally equivalent to other studies in the currency priming literature. Furthermore, the effects found in currency priming studies might vary because of the different dependent measures, different research teams and different subject pools. In light of these differences between study designs, we consider a random effects model to be the most appropriate model for our meta-analysis.

We systematically searched for publications on currency priming by entering the following search terms in both PsycINFO and Google scholar: (currency OR money) AND (priming OR prime*). We also looked at reference lists of already found articles. Our inclusion criteria were that the study should randomly assign participants to the experimental and control conditions. Studies should use a money prime in the experimental condition and compare these primed participants on some outcome variable with participants in a control condition. We subsequently e-mailed the authors of articles that met our inclusion criteria and asked them for (un)published

material suitable to our meta-analysis3. We calculated the necessary meta-analytics statistics for all

the studies included in our analysis and whenever necessary we e-mailed the authors of articles

that did not report the information we needed for our analysis4.

We have performed all our analyses with the open source software R (

https://www.r-project.org/) and used the Metafor package (Viechtbauer, 2010) for the meta-analyses. We used

multiple techniques to check for publication bias in the currency priming literature meta-analysis: (1) We tested for funnel plot asymmetry by regressing study outcomes on the standard error of

the effect size (Sterne & Egger, 2001; 2005). (2) We tested whether the distribution of observed

p-values in the currency priming literature differed from the expected distribution of p-values

(Simonsohn, Nelson & Simmons, 2013). (3) We tested whether the number of observed

significant results deviated from the expected number of significant results (Ioannidis &

Trikalinos, 2007).

In the second step of our project, we recalculated the MLRP meta-analysis performed by

Klein et al. (2014a) on the data of the 36 teams that replicated the currency priming study. We

did this to make sure that the statistics that we used in our simulations (step 3) matched those calculated by the authors of the Many Labs replication project.

In the third step of our project, we drew samples from each of the 36 portions of the data, according to the different scenarios outlined in Table 1. In scenarios without publication bias, we drew one large sample (N=160) with replacement from each of the 36 currency priming datasets and in scenarios with publication bias we drew four small samples (N=160/4=40) with

replacement from each of the 36 datasets and reported only the result of the sample with the lowest p-value of effects in the right direction (i.e. a higher endorsement of the social system in

3

Although we aimed to find as much studies as possible, we are aware of the file-drawer effect (Rosenthal, 1979) so we realize that there might still be some relevant studies that we did not include in our analysis. We therefore invite researchers to contact us if they have any additional material that was not included in our meta-analysis.

4

For each study in our meta-analysis we used an effect size estimate; a variance of the effect size estimate; a NHST p-value of the statistical test on the difference between the experimental and control condition; and some

information to test for moderators in a meta-regression (journal, impact factor, citations, year of publication, lab study or online, prime type, dependent variable type).

(7)

the currency priming condition than in the control condition)5. In addition, we applied different

combinations of questionable research practices to both the publication bias and no publication bias samples.

Table 2 summarizes the questionable research practices that we applied to the currency priming null effect data, in the order of the pervasiveness of their application by researchers (John et al., 2012). The first QRP was to test a second dependent variable that is correlated with the primary dependent variable and to report only the one that showed a significant effect (if both variables are significant we report the variable with the lowest p-value). This QRP has a

self-admittance rate of 65% among psychology researchers (John et al., 2012). In our study, we

applied this QRP by splitting the eight-item system justification scale in two, thereby creating two correlated dependent variables (we checked whether this correlation was statistically significant).

Table 1: Eight sampling scenarios to apply to the null effect data Scenario Publication bias QRPs

1 No 0 2 No 1 3 No 1&2 4 No 1&2&3 5 Yes 0 6 Yes 1 7 Yes 1&2 8 Yes 1&2&3

Table 2: Three questionable research practices QRP Description

1 Test a second dependent

variable correlated with the primary dependent variable

2 Add 10 participants until

significance is reached (sequential testing)

3 Remove outliers based on

post hoc criteria

The second QRP concerns sequential testing, which implies that researchers check the data for a significant effect and, if this is not the case, add 10 more participants than initially planned, and iterate this procedure until significance is reached. Fifty-seven percent of psychology

researchers admit to have done it at least once (John et al., 2012). In our study we added 10

participants and checked whether the results were significant. If not, then we added another 10 participants and repeated this process up to a total of five times. If after five iterations the results still failed to reach significance, we used the result with the lowest p-value. The third and last QRP involved deleting outliers after checking its influence on the results. If the results showed no significant effect, then we removed the outliers based on Z-scores. Forty-one percent of

psychology researchers admit to have done this QRP at least once (John et al., 2012). In our

study, we chose to remove outliers with a threshold of an absolute Z value larger than 2, similar to the threshold in the Bakker et al. (2012) article.

After applying the eight different scenarios to samples of the replication data, we tested to what extent these scenarios led to effect size bias and false positive inflation. We expected that –

building upon the results of Bakker and colleagues (2012) – scenarios with more QRPs and

multiple small studies instead of one large study, would lead to more bias in the effect size and a higher percentage of false positive findings.

Regarding the meta-analysis on the currency priming literature, we will perform tests to inspect for presence of publication bias and/or p-hacking. If publication bias or p-hacking turns out to be likely, then the statistics of tests for funnel plot asymmetry can be compared to those of the eight QRP/PB scenarios in the first step of our project, in order to assess which of those scenarios most resembles the literature on currency priming. We expect that the currency priming literature suffers from effect size inflation (because of the MLRP null finding, the small study sample sizes, diversity of outcome measures and the sudden “hotness” of the field; Ioannidis, 2005). However, it might also be possible that there are real effects present in parts of the

5

(8)

currency priming literature. Because the currency priming manipulation has been used to study a wide variety of dependent variables, we expect to find large amount of heterogeneity in our meta-analysis. We will conduct meta-regression analyses to inspect for the presence of moderators that can possibly explain such heterogeneity.

(9)

Results

Meta-analysis currency priming studies

We screened 74 potential papers for relevance and exclude 57 because they either did not use a currency priming manipulation or concerned a review or other non-experimental article. A total of 17 published articles met the inclusion criteria of our meta-analysis, containing 41 suitable experiments. Together with the information of 104 unpublished studies we have included 145 experiments in our meta-analysis (due to contact with other researchers more unpublished studies continue to come in). A list of all included and excluded studies is available on request.

Figures 1 & 2 (next page): Forest plots of published (figure 1) and unpublished (figure 2) currency priming experiments, sorted by precision of effect size estimate

RE Model −1.00 0.00 1.00 2.00 3.00 4.00

Hedges' g

Vohs_et_al.,_2006_exp9 Vohs_et_al.,_2006_exp7 Caruso_et_al.,_2013_exp1 Vohs_et_al.,_2006_exp5 Vohs_et_al.,_2006_exp1 Boucher_&_Kofos,_2012_exp2 Vohs_et_al.,_2006_exp8 Vohs_et_al.,_2006_exp3 Hansen_et_al.,_2012_exp2 Vohs_et_al.,_2006_exp6 Vohs_et_al.,_2006_exp4 Kouchaki_et_al.,_2013_exp1 Boucher_&_Kofos,_2012_exp1 Mogilner,_2010_exp2 Gasiorowska_et_al.,_2012b_exp2 Gasiorowska_et_al.,_2014_exp2 Tong_et_al.,_2013_exp2 Tong_et_al.,_2013_exp4 Gino_&_Mogilner,_2014_exp1 Hansen_et_al.,_2012_exp1 Kouchaki_et_al.,_2013_exp3 Gasjorowska_et_al.,_2012a_exp1 Tong_et_al.,_2013_exp1 Tong_et_al.,_2013_exp3 Caruso_et_al.,_2013_exp3 Zaleskiewicz_et_al.,_2013_exp4 Zhou_et_al.,_2009_exp3 Mukherjee_et_al.,_2013_exp1 Mukherjee_et_al.,_2013_exp2 Gasiorowska_et_al.,_2014_exp4 Kouchaki_et_al.,_2013_exp4 Zhou_et_al.,_2009_exp4 Morry_et_al.,_2014_exp1 Gasiorowska_et_al.,_2014_exp3 Morry_et_al.,_2014_exp2 Kouchari_et_al.,_2013_exp2 Roberts_&_Roberts,_2012_exp1 Gasiorowska_et_al.,_2012b_exp1 Gino_&_Mogilner_2014_exp4 Caruso_et_al.,_2013_exp2 Caruso_et_al.,_2013_exp4 1.28 [ 0.42 , 2.14 ] 1.02 [ 0.17 , 1.88 ] 0.77 [ 0.03 , 1.51 ] 1.20 [ 0.49 , 1.90 ] 0.83 [ 0.13 , 1.54 ] 1.14 [ 0.48 , 1.80 ] 1.03 [ 0.37 , 1.69 ] 0.64 [ −0.01 , 1.29 ] 0.77 [ 0.12 , 1.42 ] 0.62 [ −0.01 , 1.26 ] 0.62 [ 0.01 , 1.22 ] 0.74 [ 0.17 , 1.32 ] 0.66 [ 0.10 , 1.21 ] 0.74 [ 0.21 , 1.27 ] 3.11 [ 2.58 , 3.64 ] 1.16 [ 0.63 , 1.69 ] 0.61 [ 0.09 , 1.14 ] 0.60 [ 0.10 , 1.11 ] 0.68 [ 0.18 , 1.18 ] 0.49 [ −0.01 , 0.99 ] 0.59 [ 0.09 , 1.08 ] 0.48 [ −0.01 , 0.97 ] 0.70 [ 0.22 , 1.19 ] 0.54 [ 0.09 , 0.99 ] 0.50 [ 0.06 , 0.95 ] 0.45 [ 0.00 , 0.89 ] 0.66 [ 0.22 , 1.10 ] 0.49 [ 0.05 , 0.92 ] 0.46 [ 0.04 , 0.88 ] 0.62 [ 0.20 , 1.05 ] 0.57 [ 0.15 , 0.99 ] 0.80 [ 0.39 , 1.22 ] 0.29 [ −0.12 , 0.70 ] 1.23 [ 0.85 , 1.61 ] 0.38 [ 0.00 , 0.75 ] 0.63 [ 0.26 , 1.00 ] 0.37 [ 0.00 , 0.74 ] 0.63 [ 0.27 , 0.99 ] 0.39 [ 0.06 , 0.72 ] 0.44 [ 0.13 , 0.74 ] 0.28 [ 0.04 , 0.52 ] 0.71 [ 0.57 , 0.85 ]

(10)

RE Model −1.00 −0.50 0.00 0.50 1.00 1.50 2.00

Hedges' g

Chin_et_al.,_2011_unp1 Reutner_et_al.,_2013_unp1 Hansen_et_al._2013_unp1 Mead_et_al.,_2007_unp1 Wurzbach_et_al.,_2013_unp3 Caruso_et_al._2012_unp2 Mead_et_al.,_2008_unp1 Xie_et_al.,_2011_unp3 Xie_et_al.,_2011_unp4 Reutner_et_al.,_2013_unp2 Gasiorowska_et_al.,_2014_exp5 Xie_et_al.,_2013_unp5 Xie_et_al.,_2011_unp2 Xie_et_al.,_2011_unp1 Caruso_et_al.,_2013_unp1 Caruso_et_al.,_2012_unp8 Caruso_et_al.,_2011_unp3 Wurzbach_et_al.,_2013_unp2 Serfas_et_al.,_2014_unp1 Wurzbach_et_al.,_2012_unp1 kouchaki_et_al.,_2013_unp2 kouchaki_et_al.,_2013_unp4 Gulen_et_al.,_2012_unp14 Caruso_et_al.,_2012_unp9 Caruso_et_al.,_2013_unp16 Caruso_et_al.,_2011_unp2 Caruso_et_al.,_2012_unp12 kouchaki_et_al.,_2013_unp5 Caruso_et_al.,_2012_unp10 Gasiorowska_et_al.,_2013_unp1 Gulen_et_al.,_2011_unp1 Caruso_et_al.,_2012_unp4 Caruso_et_al.,_2012_unp6 Caruso_et_al.,_2013_unp7 Caruso_et_al.,_2013_unp8 Caruso_et_al.,_2012_unp7 Caruso_et_al.,_2012_unp5 Caruso_et_al.,_2012_unp11 Caruso_et_al.,_2013_unp9 Wurzbach_et_al.,_2014_unp6 Mogilner,_2010_unp1 Caruso_et_al.,_2012_unp14 Caruso_et_al.,_2012_unp13 Caruso_et_al.,_2009_unp2 Caruso_et_al.,_2009_unp1 Wurzbach_et_al.,_2014_unp5 Mead_et_al.,_2013_unp1 Mead_et_al.,_2013_unp3 Mead_et_al.,_2013_unp2 Caruso_et_al.,_2013_unp6 Caruso_et_al.,_2012_unp3 Caruso_et_al.,_2013_unp11 Caruso_et_al.,_2013_unp12 Caruso_et_al.,_2013_unp10 Wurzbach_et_al.,_2013_unp4 Caruso_et_al.,_2013_unp3 Caruso_et_al.,_2013_unp4 Caruso_et_al.,_2013_unp15 Caruso_et_al.,_2011_unp1 Caruso_et_al.,_2013_unp14 Caruso_et_al.,_2013_unp13 Caruso_et_al._2012_unp1 Caruso_et_al.,_2014_unp28 Caruso_et_al.,_2014_unp8 Caruso_et_al.,_2014_unp24 Caruso_et_al.,_2014_unp16 Caruso_et_al.,_2014_unp36 Caruso_et_al.,_2014_unp20 Caruso_et_al.,_2014_unp40 Caruso_et_al.,_2014_unp32 Caruso_et_al.,_2014_unp12 Caruso_et_al.,_2014_unp7 Caruso_et_al.,_2014_unp35 Caruso_et_al.,_2014_unp39 Caruso_et_al.,_2014_unp19 Caruso_et_al.,_2014_unp31 Caruso_et_al.,_2014_unp15 Caruso_et_al.,_2014_unp11 Caruso_et_al.,_2014_unp27 Caruso_et_al.,_2014_unp23 Caruso_et_al.,_2013_unp2 Caruso_et_al.,_2014_unp37 Caruso_et_al.,_2014_unp17 Caruso_et_al.,_2014_unp33 Caruso_et_al.,_2014_unp13 Caruso_et_al.,_2014_unp25 Caruso_et_al.,_2014_unp29 Caruso_et_al.,_2014_unp9 Caruso_et_al.,_2014_unp10 Caruso_et_al.,_2014_unp34 Caruso_et_al.,_2014_unp18 Caruso_et_al.,_2014_unp26 Caruso_et_al.,_2014_unp30 Caruso_et_al.,_2014_unp14 Caruso_et_al.,_2014_unp6 Caruso_et_al.,_2014_unp5 Caruso_et_al.,_2014_unp21 Caruso_et_al.,_2014_unp38 Caruso_et_al.,_2014_unp22 Caruso_et_al.,_2013_unp5 Caruso_et_al.,_2014_unp2 Caruso_et_al.,_2014_unp1 Caruso_et_al.,_2014_unp3 Caruso_et_al.,_2014_unp4 0.93 [ 0.07 , 1.79 ] 0.96 [ 0.30 , 1.61 ] 0.66 [ 0.02 , 1.30 ] 1.15 [ 0.54 , 1.76 ] 0.31 [ −0.30 , 0.92 ] 0.60 [ 0.01 , 1.20 ] 0.77 [ 0.18 , 1.37 ] 0.94 [ 0.36 , 1.53 ] 0.67 [ 0.15 , 1.19 ] 0.52 [ 0.02 , 1.03 ] 0.75 [ 0.26 , 1.24 ] 0.52 [ 0.04 , 1.01 ] 0.67 [ 0.19 , 1.15 ] 0.51 [ 0.04 , 0.99 ] 0.59 [ 0.13 , 1.06 ] 0.01 [ −0.44 , 0.46 ] −0.15 [ −0.60 , 0.29 ] 0.17 [ −0.27 , 0.61 ] 0.58 [ 0.14 , 1.02 ] −0.15 [ −0.59 , 0.29 ] 0.52 [ 0.08 , 0.96 ] 0.44 [ 0.01 , 0.87 ] 1.12 [ 0.69 , 1.55 ] 0.00 [ −0.43 , 0.43 ] 0.09 [ −0.34 , 0.51 ] 0.14 [ −0.29 , 0.56 ] −0.20 [ −0.62 , 0.21 ] 0.46 [ 0.05 , 0.87 ] 0.15 [ −0.26 , 0.56 ] 0.45 [ 0.05 , 0.84 ] 0.76 [ 0.37 , 1.16 ] −0.05 [ −0.44 , 0.35 ] 0.07 [ −0.33 , 0.46 ] −0.21 [ −0.60 , 0.18 ] −0.04 [ −0.43 , 0.35 ] 0.16 [ −0.23 , 0.55 ] −0.02 [ −0.41 , 0.37 ] 0.19 [ −0.18 , 0.57 ] −0.19 [ −0.56 , 0.18 ] −0.21 [ −0.57 , 0.16 ] 0.30 [ −0.06 , 0.66 ] 0.12 [ −0.22 , 0.46 ] −0.07 [ −0.42 , 0.27 ] 0.15 [ −0.18 , 0.47 ] 0.12 [ −0.20 , 0.45 ] 0.28 [ −0.04 , 0.60 ] 0.10 [ −0.21 , 0.41 ] −0.05 [ −0.35 , 0.26 ] −0.01 [ −0.32 , 0.29 ] 0.15 [ −0.15 , 0.46 ] −0.14 [ −0.44 , 0.15 ] 0.01 [ −0.28 , 0.30 ] −0.01 [ −0.30 , 0.28 ] 0.10 [ −0.18 , 0.39 ] 0.32 [ 0.04 , 0.60 ] 0.18 [ −0.10 , 0.45 ] −0.05 [ −0.33 , 0.22 ] −0.16 [ −0.42 , 0.11 ] 0.16 [ −0.10 , 0.41 ] 0.33 [ 0.10 , 0.57 ] 0.27 [ 0.03 , 0.50 ] 0.00 [ −0.22 , 0.22 ] −0.25 [ −0.47 , −0.03 ] 0.28 [ 0.06 , 0.49 ] −0.23 [ −0.44 , −0.01 ] −0.12 [ −0.33 , 0.10 ] −0.11 [ −0.33 , 0.10 ] −0.09 [ −0.30 , 0.13 ] −0.08 [ −0.30 , 0.14 ] −0.02 [ −0.23 , 0.20 ] −0.01 [ −0.23 , 0.20 ] 0.38 [ 0.17 , 0.59 ] −0.17 [ −0.38 , 0.04 ] −0.19 [ −0.39 , 0.02 ] −0.10 [ −0.31 , 0.10 ] −0.10 [ −0.31 , 0.11 ] −0.01 [ −0.22 , 0.19 ] −0.01 [ −0.22 , 0.20 ] 0.00 [ −0.21 , 0.20 ] −0.01 [ −0.22 , 0.19 ] 0.16 [ −0.03 , 0.35 ] 0.18 [ −0.01 , 0.36 ] 0.03 [ −0.15 , 0.22 ] −0.03 [ −0.21 , 0.16 ] 0.02 [ −0.17 , 0.20 ] −0.07 [ −0.25 , 0.12 ] −0.05 [ −0.24 , 0.13 ] 0.02 [ −0.16 , 0.21 ] −0.27 [ −0.45 , −0.08 ] −0.18 [ −0.36 , 0.01 ] −0.21 [ −0.39 , −0.02 ] 0.12 [ −0.06 , 0.30 ] 0.10 [ −0.09 , 0.28 ] −0.21 [ −0.39 , −0.02 ] 0.24 [ 0.05 , 0.42 ] −0.14 [ −0.32 , 0.05 ] −0.01 [ −0.19 , 0.18 ] 0.03 [ −0.15 , 0.22 ] −0.01 [ −0.20 , 0.17 ] −0.10 [ −0.27 , 0.07 ] −0.11 [ −0.27 , 0.04 ] −0.08 [ −0.23 , 0.08 ] −0.06 [ −0.22 , 0.09 ] 0.05 [ −0.10 , 0.21 ] 0.10 [ 0.04 , 0.16 ]

(11)

We also conducted separate random effects meta-analyses for both the published and

unpublished work. Figures 1 and 2 show Hedges’ g including a 95% confidence interval for each published and unpublished experiment respectively. The results are sorted according to the precision of the effect size estimate. For each included experiment, we calculated the Hedges’ g effect size for the difference between the money prime and control condition on a given

dependent variable. We found overall significant effects for both the published studies (g = .71, p < .0001, 95%CI [.57, .85]) and unpublished studies (g = .1, p = .0001, 95%CI [.04, .16])

separately, as well as for all studies combined (g = .26, p < .0001, 95%CI [.19, .33]). The Q-test

for heterogeneity of effect sizes is significant (Q(144)=659.66, p<.0001, 𝜏!=.15), indicating that

the included studies are not evaluating a similar effect. Although the estimated amount of

heterogeneity (𝜏!) decreases after adding to the random effects model a moderator indicating

whether a study is published or not, the Q-test for residual heterogeneity remains significant, indicating that publication status does not explain all heterogeneity present in the data

(Qe(143)=410.34, p<.0001 𝜏!=.087). It turns out that the moderator publication status

significantly influences the effect size estimate (β =.33, Qm(1)=73.38, p <.0001), with published

studies having a larger estimate than unpublished studies. The funnel plot in figure 3 visualizes this discrepancy in effect size between published and unpublished studies. Compared to the black dots (published studies), the white dots (unpublished studies) lie high on the y-axis, indicating a lower standard error and therefore higher precision of the effect size estimates in unpublished studies. Furthermore, the white dots are positioned more to the left of the x-axis than the white dots, indicating that unpublished material has an effect size estimate closer to zero than published material. The asymmetry of this funnel plot shows that studies with a higher precision show effect size estimates closer to zero. We can test this funnel plot asymmetry statistically by

including the standard error as a moderator in our random effects meta-analysis. It turns out that

the standard error significantly moderates the effect size estimate (βse =3.51, Qm(1)=125.58, p

<.0001), with more precise studies showing lower effect size estimates. This pattern might indicate the presence of publication bias within the currency-priming field.

Figure 3: Funnel plot showing the effect size estimates and standard errors for both published studies (white) and unpublished studies (black) on currency priming.

Hedges' g Standard Error 0.439 0.330 0.220 0.110 0.000 −1.00 −0.50 0.00 0.50 1.00 1.50 Overall Published Unpublished

(12)

To further inspect the possibility of publication bias and that of selective analysis

reporting (p-hacking) we can look at the distribution of statistically significant p-values, also called

the p-curve (Simonsohn, Nelson, & Simmons, 2013). This distribution is expected to be right

skewed only for true effects, because true effects will deliver more low significant p-values (e.g. .01) than high significant p-values (e.g. .04). The left side of figure 3 shows the p-curve for the all studies on currency priming and the right side shows the curve for the published studies only. The red line shows the observed distribution of significant p-values, the green line shows the expected distribution of significant p-values for a true effect and the blue line shows the expected

p-value distribution for a null effect. It turns out that the observed p-value distribution does not

match the expected distribution and shows a peak for p-values between .04 and .05, suggesting the presence of p-hacking. We can statistically determine whether the observed p-curve contains evidential value by testing whether it is right skewed. It turns out that we reject the null

hypothesis that the observed p-curve is not right skewed (χ2(146)= 249.82, p < .0001). This implies that the studies in the currency priming literature contain evidential value. Although this test indicates that the observed p-values correctly show a right skewed distribution, the higher than expected number of p-values just below .03 and .05 hints at the presence of p-hacking.

Figure 3: Observed, expected and null effect p-curve for all (left) and published (right) currency priming studies.

Another test to check whether the data on currency priming contains an excess of

significant results is the Ioannidis Trikolos test (2007). This test first calculates the expected

number of significant p-values based on the statistical power of each study6 and subsequently

compares this expected number to the observed number of significant findings on a chi-square distribution with 1 degree of freedom. It turns out that for the published data on currency

priming there is an excess of significant findings (observed = 39, expected = 29.35, χ2(1)= 11.16,

p = .0008). However, this effect disappears after including the unpublished studies on currency

priming (observed = 72, expected = 62.46, χ2(1)= 2.56, p = .11). These results imply that there

are significantly more published significant results than we would expect given the power of each study on currency priming and the absence of this effect after including unpublished material hints at the presence publication bias in the currency priming field.

6

For each experiment, statistical power estimates have been computed along the lines of Cohen (1988) by inserting each study’s effect size, sample size and alpha level in the R-package “pwr”.

p−value P ercentage 0 10 20 30 40 50 0.01 0.02 0.03 0.04 0.05 0.06 Observed Expected Null effect p−value P ercentage 0 10 20 30 40 50 0.01 0.02 0.03 0.04 0.05 0.06 Observed Expected Null effect

(13)

The possible presence of p-hacking and publication bias in the currency priming literature also becomes clear after adding the studies of the ManyLabs Replication Project (MLRP) to the funnel plot of the meta-analysis, as we did in figure 4. This figure shows data from the MLRP, as well as the earlier presented published and unpublished experiments on currency priming. In the figure we distinguished between studies of high and low statistical power. Whether a study has high or low power was calculated as follows: the meta-analysis of the published currency priming studies revealed a mean effect size estimate of g = .71. If we assume that this effect size reflects the true state of affairs, we can compute the achieved power of each study based on its sample size and chosen level of statistical significance. Within psychology, a study is considered adequately powered with power of .80 or higher (Cohen, 1992). Given a standardized mean difference of .71 and a significance level of .05, the required sample size to obtain a power of .80 equals 68.

Figure 4: Funnel plot showing the effect size estimates and standard errors for the high-powered studies (black;

N>68) and the underpowered studies (white; N68).

Figures 4 shows the funnel plots of studies on currency priming, where studies with a sample size higher than 68 are colored black and studies with a lower sample size are colored white. It turns out that the low powered studies show a mean Hedges’ g estimate of .43 (p<.0001, 95%CI [.35, .50]), while the estimate of the high-powered studies equals 0.04 (p = .0421, 95%CI [.001, .069]). It could be argued that high-powered studies are a better reflection of the true underlying effect size than low powered studies. One reason to support this claim is that more time and resources are spend on larger studies, increasing the chance that they are of high

methodological quality (Sterne, Gavaghan & Egger, 2000). This higher study quality makes it

easier to get published even if the results are negative, making publication bias or questionable research practices superfluous. Furthermore, larger studies are shown to be less subject to bias from p-hacking and publication bias (Bakker et al., 2012). In figure 5 we assumed that the true effect equals the effect size estimate of the high powered studies, which makes it very clear that almost all high-powered studies fall within the shape of the white funnel, while the low powered

Hedges' g Standard Error 0.439 0.330 0.220 0.110 0.000 −1.00 −0.50 0.00 0.50 1.00 1.50 Overall High power Low power

(14)

studies tend to mostly fall outside of this funnel. Heterogeneity tests indicate that the underpowered studies are heterogeneous (Q(52)=99.62, p<.0001), but this heterogeneity

disappears after including study precision as a moderator (Egger’s test; βse =4.33, Z = 7.4, p

<.0001; Qe(51)=43.88, p=.75, 𝜏!=0). These results again imply that publication bias might be

present in the currency priming literature (PB/QRP scenario). However, it might also be possible that other “hidden” moderators explain the heterogeneity in effect sizes of the currency priming literature (Moderator scenario). The next section will investigate the moderating influence of the dependent variable and prime type.

Figure 5: Funnel plot showing the effect size estimates and standard errors for the high-powered studies (black;

N>68) and the underpowered studies (white; N68).

Effect size moderators

In a recent narrative review, Vohs and Baumeister (in preparation), discuss several types of dependent variables that are used in the currency priming literature: agency and goal pursuit, focus on self concern, less interpersonal concern, emotions, values and cognitive factors. Based on this review, we categorized each experiment on currency priming according to the dependent variable types. Within the values category, we distinguished between personal and political values, because they are two different constructs. We also came across a number of studies that looked at the influence of currency priming on purchasing behavior and wealth issues, so we added that to our list of dependent variable categories. Along the same lines, we classified each experiment according to five different types of currency primes.

Hedges' g Standard Error 0.439 0.330 0.220 0.110 0.000 −1.00 −0.50 0.00 0.50 1.00 1.50 High power Low power

(15)

Figure 6: Summary effects for subgroup meta-analyses of different types of currency primes and dependent variables. For each subgroup analysis the number of studies (N) is shown, together with the Hedges’ g estimate and a 95% confidence interval.

Figure 6 shows the summary effects for the subgroup meta-analyses of the different types of currency primes and dependent variables. We only included studies with high statistical power (sample size higher than 68; see previous section) and we first performed moderator analyses on the random effects model in order to check whether the factors prime type and dependent variable type significantly moderate the currency priming effect size estimate. It turned out that

both prime type (Qm(4)=10.17, p =.0376) and dependent variable type (Qm(8)=51.44, p <.0001)

significantly moderate the effect size estimates of the high powered studies.

Inspection of the upper part of figure 6 shows that the subgroup analyses of the image and descrambling primes show a slight positive effect, while the combination of those primes failed to show a significant effect. Primes where people were asked to imagine money related topics also do not show an effect. The only currency prime with a substantial effect is the money play prime, where people are for instance asked to play a game with money or have to engage in some money related activity, such as money counting.

The lower part of figure 6 reveals that five of the eight subgroup analyses failed to show an effect. Currency priming does not appear to influence dependent variables in the categories agency and goal pursuit, political values, higher self-concern, purchasing behavior and wealth focus, or cognitive factors. Currency priming does show a reasonable effect on dependent variables categorized as measuring personal values or the reduction of interpersonal concern and it shows the largest effect on dependent variables that measure emotion, such as anger or

(16)

Creating publication bias and questionable research practices

Thirty-six different teams all over the world collected the MLRP data and a meta-analysis of the

results of those teams showed that the original currency priming effect (Caruso et al., 2013) could

not be replicated (d = .01, p = .83, 99% CI = [-0.06, 0.09]). The 99% confidence interval is narrow and includes d = 0, which makes it safe to assume that the underlying true effect size is very close to zero, at least for the specific paradigm used in this set of studies.

To asses the influence of QRPs on the meta-analysis of these data, we used these data in a bootstrap like procedure to generate 'new' MLRP data collections which we could submit to the QRP procedures. With replacement, we drew samples from each of the 36 MLRP datasets and we subsequently applied the characteristics of each of the eight different scenarios (Table 1) to those 36 MLRP datasets. We repeated this process 100 times and aggregated the resulting

percentages of significant p-values and effect size estimates. Figure 7 summarizes the influence of the 8 different scenarios on the percentage of significant p-values (red; upper row) and the

Hedges’ g effect size estimates (green; lower row). The left column shows scenarios without publication bias and the right column shows scenarios with publication bias. Each of those columns shows varying amounts of questionable research practices.

Figure 7: Percentage of significant p-values and effect size estimates for scenarios with or without publication bias and different amounts of questionable research practices

No publication bias Publication bias

Number of questionable research practices

0 1 2 3 0 1 2 3 −0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 100

% p < .05

Hedges' g

| | | | | | | | | || | | | | | | | || | | | | | | ||| | | | | | | | | | | | | | || | | | | | | | | | | || | | | | || | | | || | | | || | | | | || | | || | | | | || ||| | | | | | | | || | | | | | | | | | | | | | | || | | | | | | | | | | ||| ||| | | | | || | | | | | | | | | | | | | | | | | || | | | | | ||| ||| | || | ||| | | | ||| | | | || | | | || | ||| | | | | | | || | | | || | || | | | | | | | | | | | | | | | | || | | | | | | | | | | || | || | | | | | | | | | | | | | | | | || | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | || || | | | | | | | | || | || | | | | | | | | || || | || | | | | | | | | | |||| | | | | || | | | | | | | | | | | | | | | | | | | | | | | | || | || | | | | | | | | | || || | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | || || | ||| | | | | | | || | | | ||| | | | | | ||| | | | | | | || || | | | | | | | || | | | | | || || | | | | || | | | | | | | | ||| | | | | | | | ||| | | | | | | | | | | | | | || | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | || | || | | | | | | | | | | | | | | || | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | ||| | | | | || | | | | | | | | | | | | || | || | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | || | | | | | | | | | | | | | | | | | | | || || || | | | | | | | || | | | | | | || | | | | || | | | | | || | | | | | | | | | | | | | || | | | || | | | | | | | | | | | || | | | | ||| | | | | || | || | || || || | | | | | || | | | | | | | | | ||| || | || | | | | | || | ||| | | | | | | | | | | ||| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | || | | | ||| | | | | || | | | | ||| | | | | | || | | | | | | | | | | | | | | | | | | | || | | | | | | ||| | | || || | | | | | | | | | | | | | | | || | | | | | | || | | | | | | ||| || | | | | | | | | | | | | | || | |||| || | | || | | | | | | | || | | | | | | | | | | | || | | | | | | | | | | | | | | || | | | | | | | | || | | | | | | | | | | | | | | | | | | | || | || | | | || | | | | | || | | | | | | | | | | | | || | | | | | | | | || | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | ||| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | | | | | || | |||| | | | | | | | | | | | || | || | | | | | | | | | | | | | | | | | | | | | | | | ||| | | | | || | | | | | | | | | | | | | | | | | ||| | || || | ||| | | | | | | | | | | | | | | | | | | | | | || | | | || | | | | | || | | | | | | || | || | | | | | | | | | | | | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | | | | | | | || | | | | | | | | || | || || | | | | | | | | | | || | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || ||| | | | | | | | | | | | | | | | | | | | || | | | | ||| | | | | || | |

(17)

In the figure, the difference between the darker and lighter colored polygons is that the darker and smaller polygons represent a 95% confidence interval of the mean effect size bias (green) or proportion of false positives (red), while the lighter and longer polygons represent the spread of all 36 observed outcomes in each scenario. For instance, if we look at the Hedges’ g estimate for the non-publication bias scenario with zero questionable research practices, then we can see that the light green polygon approximately spreads evenly from g = -.05 to g = .05, which implies that for all 36 studies in this scenario, some resulted in a mean Hedges‘ g estimate of .05 and some in -.05, but most in the middle of the polygon around a mean estimate of zero. The dark green polygon shows that the 95% confidence interval of the mean Hedges’ g aggregated across studies includes zero. This implies that, as expected, without publication bias or

questionable research practices, the effect size estimate does not deviate from the effect size of zero underlying the MLRP data.

If we inspect the figure as a whole then some interesting patterns become visible. First, we can see that if there is no publication bias, then the Hedges’ g estimate remains close to the true estimate of zero, regardless of the number of QRP’s applied to the data. When, on the other hand, publication bias is present, then it becomes clear that the Hedges’ g estimate is fairly biased even without the use of QRP’s (g = .34), but gets extremely biased (g = .63) as soon as researcher start using one QRP. Interestingly, this extreme bias in effect size results as soon as researchers use only the most frequently used QRP. That is, in the presence of publication bias, adding a second correlated dependent variable and reporting only the significant one can lead to an extreme bias in effect size estimate. The bias in effect size culminates after the use of three QRP’s (g = .73) and note that this bias is almost equal to the effect size estimate of the random effects model of published currency priming studies (g = .71).

Another interesting pattern concerns the percentage of significant p-values. It turns out that this percentage is highly dependent on the number of QRP’s applied to the data. Without publication bias, the percentage of false positives steadily increases up to 50% when 3 QRP’s are applied to the data. However, if publication bias is present, then the percentage of false positives will soon rise to approximately 80%, even when only applying two QRP’s to the data.

Conform the results of Bakker et al. (2012), These results imply that the combination of publication bias with the use of QRP’s can lead to highly distorted effect sizes and extremely high percentages of false positives. The use of QRP’s without publication bias does not seem to lead to effect size bias, but does still increase the percentage of false positives.

Currency priming revisited

Our application of questionable research practices and publication bias to the (null effect) data of the ManyLabs Replication Project showed that the use of QRP’s always increase the amount of false positive findings and if publication bias is also present then the effect size estimates can get highly distorted. In what way can we use these findings to explain the results of the meta-analysis on currency priming?

One explanation of the meta-analysis was that the literature on currency priming suffers from publication bias and the use of QRP’s (PB/QRP scenario). In such a scenario, one could argue that high-powered studies more accurately describe the true effect than underpowered studies, because high-powered studies do not require the use of QRP’s to get published and are therefore less susceptible to publication bias. If we only take into account currency priming studies with a power higher than .8, then the currency priming effect size estimate dramatically decreases to g = .04. If we believe the high-powered studies and assume that the true currency priming effect is close to zero, then this raises the question how to explain the mean effect size estimate of the low powered studies (g = .43). It is tempting to conclude that this effect size estimate is – in light of the results of our simulation of QRP’s – pure bias resulting from the use of QRP’s in lower powered studies. Especially given that the majority of psychological

(18)

Prelec, 2012). Furthermore, figure 8 shows that the low powered studies in the funnel plot of published studies on currency priming resemble the pattern of MLRP data funnel plot to which we applied 3 questionable research practices and publication bias. Also, in both cases there is funnel plots asymmetry, because the study standard error significantly moderates the effect size. The inclusion of this moderator made the residual heterogeneity non-significant (figure 8 left panel: Qm(1)=17.8, p <.0001; Qe(34)=7.81, p =1; figure 8 right panel: Qm(1)=17.8, p = 0.035; Qe(19)=7.21,

p =.993).

Figure 8: Funnel plot example of scenario with publication bias and 3 QRP’s (left) and funnel plot of published studies on currency priming, categorized according to power (right).

Although these findings provide support for the PB/QRP scenario, the possibility remains that a “hidden” moderator of which we are not yet aware provides a better explanation. A more sincere conclusion would be that we do not know if and which QRP’s have been used in studies on currency priming. Furthermore, asymmetric funnel plots may suggest the presence of publication bias, but this asymmetry can also be explained by other factors. For instance, poor

methodological design of smaller studies might lead to inflated effects in those studies compared to larger studies (Egger, Smith, Schneider & Minder, 1997). Another source of funnel plot asymmetry could be the presence of true heterogeneity across studies (Sterne et al, 2011). Hence, performing subgroup analyses based on a factor that could explain this heterogeneity might lead to funnel plot symmetry within each subgroup.

To find out whether this applies to the currency priming literature, we put each of the

dependent variable and prime type subgroups to a test for funnel plot asymmetry (Sterne &

Egger, 2001; 2005). Figure 9 shows for each subgroup a funnel plot, including the test statistic and p-value of the test for funnel plot asymmetry. Inspection of the figure reveals that some subgroups show funnel plot asymmetry, while others clearly do not. Interestingly, for almost all subgroups with significant effect size estimates (Money image; Descrambling; Money play; Less interpersonal concern; Personal values; see figure 6) we found significant results on the test for funnel plot asymmetry. Only one subgroup with a significant effect size estimate showed a symmetrical funnel plot (Emotion), though the estimate in that subgroup is based on only four studies. Hedges' g Standard Error 0.361 0.271 0.181 0.090 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.00 −0.50 0.00 0.50 1.00 1.50 Hedges' g Standard Error 0.439 0.330 0.220 0.110 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1.00 −0.50 0.00 0.50 1.00 1.50 ● ● High power Low power

(19)

Figure 9: Funnel plots for the dependent variable (upper eight) and prime (lower four) type subgroups of the currency priming literature, including Q- and p-values of the tests for funnel plot asymmetry.

Discussion

In this study we performed a meta-analysis on studies using a currency priming manipulation. Although the overall effect size estimate of g=.26 was significant, additional tests showed unusual funnel plot patterns and an excess of significant results, hinting at the possibility of publication bias and the use of questionable research practices. This raises the question of whether the findings can better be explained by a (hidden) moderator or PB/QRP scenario.

Support for the PB/QRP scenario is provided by the finding that, overall, most

underpowered studies showed positive and significant effects, while studies with larger sample sizes showed more moderate effects (figure 5). Visual inspection of the distribution of significant

p-values within the currency priming literature revealed that there are more p-values between .04

and .05 and less p-values below 0.01 than one would expect in case of an existing effect (figure 3). This result is corroborated by the Ioannidis Trikalinos test that showed the published currency priming literature to contain an excess of significant findings.

Our subgroup analysis provides support for the moderator scenario by showing that different types of primes and dependent variables show different effect size estimates. More specifically, three primes showed a significant effect: the descrambling and image primes showed

Agency & goals

Observed Outcome 0.225 0.169 0.113 0.056 0.000 ● ● ● ● ● ● ● ● ● ● ● −0.60 −0.20 0.00 0.20 0.40 Qm= 1.78 p= 0.182 Cognitive tests Observed Outcome Standard Error 0.228 0.171 0.114 0.057 0.000 ●● ● ● ● ● ● ● 0.00 0.50 1.00 Qm= 0.41 p= 0.52

Higher self concern ***

Observed Outcome Standard Error 0.227 0.170 0.114 0.057 0.000 ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● −0.40 0.00 0.40 Qm= 13.76 p<.001 Personal values *** Observed Outcome Standard Error 0.226 0.170 0.113 0.057 0.000 ● ● ● ● ● ● ● ● ● ● ● ● ● −0.20 0.20 0.40 0.60 0.80 Qm= 14.58 p<.001

Purchasing & wealth ***

Observed Outcome 0.247 0.185 0.123 0.062 0.000 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● −0.40 0.00 0.40 0.80 Qm= 17.54 p<.001 Emotion Observed Outcome Standard Error 0.246 0.184 0.123 0.061 0.000 ● ● ● ● 0.20 0.40 0.60 0.80 1.00 1.20 Qm= 0.49 p= 0.48

Less interpersonal concern ***

Observed Outcome Standard Error 0.271 0.204 0.136 0.068 0.000 ● ● ● ● ● ●● ● ● ● ● ● ● 0.00 1.00 2.00 3.00 Qm= 38.39 p<.001 Political values Observed Outcome Standard Error 0.237 0.178 0.118 0.059 0.000 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ●● ●● ●●●●● −0.60 −0.20 0.20 0.60 Qm= 3.16 p= 0.08 Descrambling *** 0.247 0.185 0.123 0.062 0.000 ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● −0.40 0.00 0.40 0.80 Qm= 11.6 p= 0.001 Money image *** Standard Error 0.246 0.184 0.123 0.061 0.000 ● ● ●● ●●●●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ●●● ●●● ● ● ●● ●●●●●●● −0.50 0.00 0.50 1.00 Qm= 27.66 p<.001

Descrambling & money image

Standard Error 0.225 0.169 0.113 0.056 0.000 ● ● ● ● ● ● −0.20 0.00 0.20 0.40 0.60 Qm= 0.78 p= 0.38 Money play *** Standard Error 0.362 0.271 0.181 0.090 0.000 ● ● ● ● ●● ● ● ●● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.00 1.00 2.00 3.00 Qm= 99.05 p<.001

(20)

really small and barely significant effects, while the money play prime showed the large effect. For the dependent variable categories, only emotion, personal values, and less interpersonal concern showed significant effects. However, tests for funnel plot asymmetry showed that all except one of these significant subgroup analyses showed asymmetric funnel plots, where studies with smaller sample sizes had larger effects than studies with high sample size. These asymmetric funnel plots provide support for the PB/QRP scenario in subgroups with a positive effects size estimate. However, the possibility remains that the funnel plot asymmetry in these subgroups can be explained by moderators of which we are not yet aware and have therefore not been included in our analysis.

Our application of questionable research practices and publication bias to the (null effect) data of the ManyLabs Replication Project showed that the use of QRP’s increase the amount of false positive findings and if publication bias is also present then the effect size estimates can get highly distorted. These findings illustrate that when a research area suffers from publication bias, then effect size estimates can get highly biased as soon as researchers start to use only a single questionable research practice. The large amount of underpowered studies in the published currency priming literature, together with the high number of outcomes and low number of direct replications, makes it more likely that research claims in that field are false (Ioannidis, 2005).

If we assume that researchers of highly powered studies do not need to use QRP’s to get their study published, then those studies show a more accurate estimate of the underlying true effect size. If we only take into account studies with a power higher than .8, then the currency priming effect size estimate dramatically decreases to g = .04. But can we find any evidence that QRP’s have been used by researchers in the currency priming field? Close reading the literature for indications that QRP's of one form or another may have been employed, did not reveal any sign of QRP’s. Such signs are of course no necessary consequence of QRP’s because if

researchers apply them they do not necessarily have to report anything about it. If a researcher for instance is unaware of the problems arising from sequential testing, then the researcher might not consider this practice as deviant and therefore only report the final sample size and not mention the fact that significance tests have already been performed at some point(s) during the data gathering.

But is it plausible to assume that researchers are not aware of the unethicality of

questionable research practices? The results of a survey filled out by scientists indicate that there is a “rough consensus among researchers about the relative unethicality of the behaviors (QRP’s), but large variation in where researchers draw the line when it comes to their own behavior” (John et al., 2012, p. 527). Furthermore, researchers who admitted to using QRP’s thought that their actions were defensible (John et al., 2012, p. 528). This implies that if QRP’s have been used in the currency priming literature, then the committing researchers were likely to believe that those actions were permissible and therefore did not require any additional reporting.

However, the fact of the matter is that we do not know whether QRP’s have been used in the currency priming literature. One thing we can conclude, however, is that there are some remarkable findings in the meta-analysis on currency priming that call for explanation. First this literature shows an excess of significant findings and asymmetric funnel plots, suggesting publication bias. Second, the percentage of significant findings just below .05 is higher than expected, suggesting the use of p-hacking. Third, there is an inconsistency between the high and low powered studies with respect to the effect size estimates. Fourth, underpowered studies

found the largest and most significant effects7, but a low power implies a lower chance to find a

significant effect. These findings suggests publication bias, which implies that numerous studies have been conducted that didn't find their way into the literature due to lack of significant effects.

7

We found a positive correlation between study sample size and the reported p-value (r = .21, p = .0053), indicating that p-values closer to zero are associated with lower sample sizes.

(21)

So where did all these underpowered statistically insignificant studies go? If QRP’s have not been used in the literature, then the asymmetric funnel plot at least suggests that a large number of underpowered studies with null effects should be present in file-drawers. We

therefore request researchers to contact us if their currency priming study has not been included in our analysis8.

We limited our simulation of QRP’s to three highly prevailing QRP’s according to the

survey by John and colleagues (2012). Future research could focus on other QRP’s and

investigate their influence on effect size bias and the number of false positives. Another

limitation of our QRP simulation is that we did not assess the relative impact of each QRP on the bias in effect size and false positives. Therefore, future researchers could try to compare different QRP’s and rank them according to their biasing effects. Another possibility would be to use different operationalizations of the same QRP, such as adding N=1 at a time, instead of N=10. These would enable us to determine the severity of each QRP and subsequently design adequate measures to reprimand researchers who have committed a particular QRP.

In the present study we found a biasing effect of QRP’s and publication bias on effect size and false positives. Our findings corroborate the results of simulation studies showing that QRP’s can lead to substantial bias in effect size estimates and to an inflation of false positive

findings (Ioannidis, 2005; Simmons et al., 2011; Bakker et al., 2012). Moreover, our study showed

that this biasing effect could be applied to real data of which we could increase effect size estimates and turn the statistically insignificant effects significant. The fact that we could create extreme bias in effect size after simulating publication bias and only one QRP is especially

worrisome given the present scientific climate in which publication bias is common (Bakker et al.,

2012; Ferguson & Brannick, 2012) and where the majority of researchers admit to have used QRP’s (John et al., 2012).

To minimize future risk of publication bias and of researchers using QRP’s, we propose that researchers should before gathering the data preregister their experiments online, including a detailed analysis plan for each tested hypothesis (Chambers, 2013; Wagenmakers et al, 2012). In this plan, researchers should also clearly distinguish between the exploratory and confirmatory hypotheses that they will investigate. After collecting the data, researchers should summarize their results at this preregistration website, even when the study does not get published. An interesting new development is the recently launched journal Comprehensive Results in Social

Psychology, being the first social psychology journal to publish only pre-registered research.

Another solution to prevent different types of experimenter bias is to stimulate researchers to collaborate with other research teams who have competing hypotheses. Such adversarial

collaboration encourages research teams to reach consensus on an optimal research design before

gathering the data. The teams can then register their research plan and agree to submit the findings regardless of the results (see for instance Bateman, Kahneman, Munro, Starmer & Sugden, 2005; Matzke et al, submitted).

We further propose that journals should increase the standards for reporting statistical results. Recent estimates suggest that approximately 18% of the statistics in psychology journals

contain reporting errors (Bakker & Wicherts, 2011). To increase their standards, journals could

make sure that they for each experimental study include at least one reviewer with

methodological and statistical expertise in the review panel. This process can be facilitated by new software was that can automatically extract statistics from each article and recomputed the

p-values (Epskamp & Nuijten, in preparation). We advise journals to consider publishing

non-significant results and replication studies (more often) and to consider publishing the raw data

8

We contacted researchers on currency priming by e-mailing the corresponding author of published articles and asked for unpublished studies. Furthermore, Kathleen Vohs e-mailed her coauthors on currency priming studies to send us any unpublished material. At the moment, some data is still coming in and therefore the results reported in this article are preliminary.

(22)

alongside the research article (Wicherts & Bakker, 2012). A good step in the right direction is the recently launched Journal of Open Psychology Data, that publishes data papers of studies that have typically been reported in substantive journals (Wicherts, 2013).

Furthermore, we stimulate researchers who are designing a replication study to invest in collecting a large sample size. Recent evidence suggests that mere replication is not always

beneficial (Nuijten, van Assen & Wicherts, in preparation). When the statistical power of a

replication study is smaller than the statistical power of the original study, then the effect size estimate can become biased if publication bias works on the replications. This finding highlights the importance of aiming for high-powered replication studies. High-powered studies will lead to more precise effect size estimates and have a higher chance to be published, which makes

publication bias and QRP’s less of an issue. Therefore, we do not only recommend aiming for high power in replication projects, but also in new research directions. One problem, however, is that power calculations could be based on studies that contain biased effect size estimates and hence result in too small required sample sizes.

Finally, we propose to minimize the use of QRP’s by educating researchers about the potential biases associated with using them. In this regard, we hope that the present article will aid in paving the way towards a future with less questionable research practices, less publication bias, a more transparent scientific climate and hence to a higher quality of scientific research in general.

Referenties

GERELATEERDE DOCUMENTEN

By identifying and testing variables related to job autonomy, performance feedback, performance- based pay and performance-based promotion, my analysis gives confirmation

Multiple case study approach in combination with quantitative data were used in order to identify the role of the organizational culture on the implementation of environmental

[r]

This study seeks to support the alternative hypotheses that financial reporting quality post-SOX is negatively associated with the number of audit committee chair positions and

The per- centages of Italian psychologists who admitted to having used ten questionable research practices were similar to the results obtained in the United States although there

genetische modificatie conditie een controlevraag was die onvoldoende aansloot bij de verschafte informatie, is deze verwijderd. Er is daarom een cutoff score gehanteerd van negen

Naar aanleiding van de contrasterende bevindingen met betrekking tot de verbanden tussen cognitieve vertekeningen en het hebben van vreemde ervaringen, én een paranormale

The aim of this article is to critically analyse the experiences of Swazi society during the war with particular reference to the families that experienced conflicts as a result