Bayesian multinomal test for inequality constraints

(1)

Bayesian Multinomial Test for Inequality Constraints

Alexandra Sarafoglou, Eric-Jan Wagenmakers, Maarten Marsman

Department of Psychology, Psychological Methods, University of Amsterdam, The Netherlands

Correspondence concerning this article should be addressed to: Alexandra Sarafoglou, Department of Psychology, PO Box 15906, 1001 NK Amsterdam, The Netherlands,

E-mail: alexandra.sarafoglou@gmail.com. Abstract

The analysis of categorical data represents a standard procedure across empirical sciences and involves a wide variety of statistical approaches, among others the multinomial test and the χ2 _{goodness-of-fit test. The current re-}

search project promotes the usage of the Bayesian equivalent to these tests and specifically focuses on their extension to inequality constrained hypotheses. In this paper we propose an novel method to evaluate inequality constraints by using bridge sampling. Through simulations and the application to a concrete example we illustrate the performance of the bridge sampling method and compare it to existing methods in the literature, such as the encompassing prior approach and the conditioning method. The bridge sampling method offers several advantages over existing methods, yet our simulations indicate that this method does not yield stable estimates of the Bayes factor. Possible ways to improve the bridge sampling method are discussed.

Keywords: Bayes factor, Bayesian inference, multinomial test, chi-square test, informative hypotheses, invariance testing, order constraints

(2)

BAYESIAN MULTINOMIAL TEST 2 Contents

Notation 3

5

The Multinomial and Dirichlet Distribution 7

Bayesian Multinomial Test for Exact Equality Constraints 9

Bayesian Reanalysis of Habermann (1978) ... 10

Relation to the Savage-Dickey Density Ratio... 12

Relation to the χ2 _{Goodness-of-Fit Test ... 14}

Bayesian Multinomial Test for Inequality Constraints 14 Encompassing Prior Approach ... 16

Conditioning Method ... 17

Bridge Sampling Method ... 18

Sampling from a Truncated Dirichlet Distribution ... 20

(1) Transforming the Dirichlet random variable ... 21

(2) Implementing the Gibbs sampler ... 22

(3) The inverse CDF technique ... 24

Simulation Study 25 Methods ... 25

Simulation conditions ... 25

Generating the data sets ... 25

Sampling details ... 26

Results ... 27

Discussion... 31

Autocorrelation and Convergence ... 31

Missing overlap between prior and posterior density ... 32 Analyzing Haberman Using Inequality Constrained Hypotheses 35

General Discussion 37

(3)

BAYESIAN MULTINOMIAL TEST 3

k=1 • Data

Notation – N : Total sample size (population) – x: Vector of observed data

• Parameters and parameter spaces – θ : Vector of cell probabilities

– α : Vector of Dirichlet parameters, with A = },K _α k

– β : Rate parameter of Gamma distribution – ρ: Correlation coefficient

– φ: Angle between two vectors – µ: Mean

– σ: Standard deviation

• Probability distributions and random variables

– x ∼ Multinomial(N, θ) : Multinomial probability distribution – θ ∼ Dir(α) : Dirichlet probability distribution

– Z ∼ Gamma(α, β) : Gamma probability distribution – Y ∼ Uniform(0, 1) : Uniform probability distribution – x1 ∼ Normal(µ, σ) : Normal distribution function • Functions and densities

– p() : Probability density function – P(): cumulative distribution function – c : Normalizing constant of density p – B() : Beta function

– Γ(): Gamma function – b(): Bridge function – I(): Indicator function • Indexes

(4)

BAYESIAN MULTINOMIAL TEST 4 – j = 1, · · · , J : Index denoting samples from a posterior distribution

– k = 1, · · · , K : Index denoting categories

– m = 1, · · · , M : Index denoting order constraints – t = 1, · · · , T : Index denoting iterations

• Parameter spaces

– R : Restricted parameter space of a given density – Ω : Parameter space of a random variable

• Other – H : Hypothesis – BF : Bayes factor – E: Expected value – U: Upper bound – L: Lower bound

(5)

The analysis of categorical data has a long history and can be traced back to the some of most influential statisticians: Karl Pearson and Sir Ronald Fisher. In 1900, Pearson first introduced the χ2_{-statistic and thus the initial versions of the now known multinomial}

test and χ2 _{goodness-of-fit test. The multinomial test checks for the equal distribution of}

cell probabilities and the χ2 _{goodness-of-fit test examines if expected versus predicted cell}

probabilities deviate from each other.

However, it turned out that Persons calculation of the degrees of freedom for the family of χ2 _{distributions was incorrect; the adjusted calculation was given 22 years later}

by Fisher (1922) which lead to a series of headed debates between the two statisticians.1

In addition to the standard usage of the χ2 _{goodness-of-fit test both Pearson Fisher used}

the statistic to reveal forged data. It is alleged that one of the motivating forces in de- veloping the multinomial test was Pearson drive to check the fairness of roulette spins in Monte Carlo (Agresti, 2003). In 1936 Fisher demonstrated that data from Gregor Mendels famous experiments about biological inheritance were probably forged since his data met the corresponding expectations too well.

Due to its influence on modern statistics Cochran (1952) commented that “[Pearsons] remarkable paper is one of the foundations of modern statistics” (p. 316). Today, the multinomial test and the χ2 _{goodness-of-fit test have long become a standard procedure}

and are of interest to all empirical disciplines. For instance, in the field of biology and ecology researchers investigate the observed versus expected occurrence of certain species, e.g., leopards, in a specific habitat (Tan et al., 2017; Dice, 1945). In marketing and business research the purchasing behaviors of the consumers are investigated (Mittelman, Mittelman, Andrade, & Andrade, 2017) and in business science the emotions of colleagues towards each other (Ford, Agosta, Huang, & Shannon, 2017). In political sciences researchers investigate the perceptions and opinions of gender equality among men and women (Allen & Savigny, 2016). Public health research is interested in the oral health of elderly people (Hoeksema et al., 2017). Clinical psychologists are interested in the question if medication encourages the abuse of certain drugs in teenagers with ADHD (Hammerness, Petty, Faraone, & Biederman, 2017).

For an in-depth description, let us consider the study conducted by Uhlenhuth, Lip- man, Balter, and Stern (1974). To investigate the unreliability of retrospective surveys Uhlenhuth et al. (1974) asked 735 participants in the context of a urban health survey to indicate out of a list of negative life events, life stresses and illnesses which event they had experienced and during which of the 18 months before the interview. A subset of this data

1_{For a detailed historical overview of Pearsons and Fishers contributions and debates the interested reader} is referred to Fienberg and Hinkley (1980) and Agresti (2003).

(6)

BAYESIAN MULTINOMIAL TEST 6 set was reanalyzed by Haberman (1978, p. 03). Habermann investigated the 147 participants who reported only one negative life event over this time span and tested whether the frequency of the reported events was equally distributed over the 18 months period.

Figure 1 illustrates the proportions of reported negative life events for Habermanns sample. To test his hypothesis, Haberman (1978) conducted a multinomial test and re- jected the null hypothesis that the proportion of reports of negative life events was equally distributed over the 18 months. Moreover, the data indicates that participants reported fewer negative life events the more they got back in time. The study represents a critical reflection on retrospective surveys, since it suggests that people do not reliably remember or selectively report negative life events.

The frequentist approach of testing categorical data is well developed, yet the usage of p-value in null hypothesis statistical testing (NHST) is subject to numerous critiques (for an overview, see for example Wagenmakers, 2007 and Wagenmakers et al., in press ). Here, we outline as a concrete and practical alternative to frequentist hypothesis testing, hypothesis testing using Bayes factors (Jeffreys, 1935, 1961; Kass & Raftery, 1995).

To illustrate the Bayesian methods on a concrete example Habermanns (1978) research question will remain the working example throughout this paper. We will conduct a Bayesian reanalysis to answer the research question if the proportion of reported negative life events changed in the 18 months period. The main focus of this research project will be the extension of the Bayesian multinomial test to inequality constrained hypotheses. On that condition, we will test the order constrained hypothesis that the report of negative life events decreases over time. The R Code to conduct the analyses as well as the data are available online in the Open Science Framework (OSF) at: https://osf.io/59tce/.

This paper gives a brief introduction into the Bayesian multinomial test, in which the focal point lies on the introduction of a novel approach to evaluate inequality constrained hypotheses. The outline of this paper is as follows. The first section briefly describes the multinomial and the Dirichlet distribution, which is fundamental for the statistical approaches that are dealing with categorical data. The second section defines the Bayesian equivalent to the multinomial test for exact equality constraints and explains its relationship to the χ2 _{goodness-of-fit test. The third section extends the multinomial test to inequality}

constrained hypotheses and gives an overview of the existing methods that have been proposed in the literature, i.e., the encompassing prior approach (Klugkist, Kato, & Hoi- jtink, 2005) and the conditioning method (Mulder et al., 2009). Moreover, a novel approach to test inequality constraints by using the bridge sampling method is proposed (Bennett, 1976; Meng & Wong, 1996). The fourth section provides a simulation study to explore the performance of the three methods for inequality constrained hypotheses. The fifth section demonstrates the application of these methods on our working example and is followed by

(7)

BAYESIAN MULTINOMIAL TEST 7 k=1 12 10 8 6 4 2 0 1 3 6 9 12 15 18

Months Before Interview

Figure 1 . This figure displays the proportion of reported negative life events over the course of the 18 months prior to the interview for Habermanns (1978) sample.

a general discussion.

The Multinomial and Dirichlet Distribution

This section introduces the multinomial and Dirichlet distribution, which are of importance to construct the Bayesian equivalent to the multinomial test. In Habermanns (1974) example the question in which month a participant had experienced a negative life event has 18 possible answer options, one for each month and results in count data. As such it can be described in terms of a multinomial distribution. The multinomial distribution allows us to model the proportions of counts for reporting a negative time event within the 18 months period.

Let X = (X1, X2 · · · , XK ) be the multinomial random variable “Month in which a

life stress occurred”–described as X ∼ Multinomial(N, θ)– with K = 18 categories. Let N = },K xk = 147 denote the total number of reports and θk denote the probability to

report a negative life event in month k, where },K _θ

k = 1 with θ ∈ (0, 1). The probability mass function (PMF) of the multinomial distribution is given by:

k=1

P

a

rti

c

ip

a

n

ts

r

e

po

rti

n

g

a

nega

ti

v

e

l

ife

e

ven

t

(

in

%)

(8)

BAYESIAN MULTINOMIAL TEST 8 K k=1 Γ (xk + 1) θ − , (2) ) = (},_K p | θ p θ Γ (},K xk + 1 ' K

Note that the PMF of the multinomial distribution describes the likelihood function of the data vector x given the parameter vector θ.

To estimate the parameters of interest–i.e., the cell probabilities of the observed categories–the Bayesian framework aims to include the uncertainty of existing prior beliefs of the researcher about the parameter estimates by means of a prior probability distribution (Wetzels, van Ravenzwaaij, & Wagenmakers, 2015; Wagenmakers et al., in press). Here we will use the Dirichlet distribution to capture our prior beliefs about the proportion of counts θ for reporting a negative life event. The choice to use the Dirichlet distribution is justified by its beneficial attributes. First the Dirichlet distribution is a conjugate distribution to the multinomial distribution, which means that we prior and posterior come from the same distributional family. Second, the Dirichlet distribution allows for an intuitive interpretation of the parameter vector α; it can be seen as the vector of prior counts or prior observations in each category. Therefore, the Dirichlet distribution is often considered a suitable prior distribution for the parameters θk (e.g., Albert & Denis, 2011). The prior

density of the Dirichlet distribution is:

p(θ) = 1 B(α) K αk 1 k k=1

where α = (α1, α2, · · · , αK ) denotes the vector of prior counts, and all α_kare assumed to be non-negative.2 _{The normalizing constant B(α) is the multivariate Beta function, which}

is expressed in terms of gamma functions:

K ( k=1 Γ(αk ) = (α1 − 1)! × (α2 − 1)! × · · · × (αK − 1)! B α Γ k=1 αk ' (α1 + α2 + · · · + αK − 1)!

With Bayes’ rule, we can update the prior probability density function with the data to receive the posterior density:

(x ) ( )

p(θ | x) = . (3)

p(x)

2_{Note that, when k = 2 then the Dirichlet distribution corresponds to the Beta distribution: θ} 1 Beta(α1, α2). . k=1 k=1 k=1 k=1 p(x | θ) = θxk . (1) k ∼

(9)

BAYESIAN MULTINOMIAL TEST 9 θ − . (4) ., k=1 odds odds .,

In the case of the Dirichlet prior for the Multinomial parameter, we obtain a closed form solution for the posterior distribution. In general, if X ∼ Multinomial(N, θ) and θ ∼ Dir(α), then θ | x ∼ Dir(x + α), which is given by:

p(θ | x) = 1 B(x + α) K xk +αk 1 k k=1

Lastly, we define the marginal likelihood of the data vector x. The marginal likelihood is defined as the weighted average of the likelihood over the prior:

p(x) marginal likelihood = r p(x | θ) p(θ) dθ likelihood., prior., Γ (},K xk + 1 ' 1 k=1 K _{Γ (x} k + 1) × B(x + α) × , (5) B α k=1

with θ being the vector containing the model parameters. This distribution is also known as the Dirichlet-multinomial distribution, which was first introduced by Pólya (1930) in the context of an urn problem, which is why the Dirichlet-multinomial distribution is also known as multivariate Pólya distribution. Pólya also showed that the marginal distributions of the individual parameters θk is the Beta distribution (Pólya, 1930, p.150):

θk ∼ Beta(αk , A − αk ),

where A is defined as the sum over the parameter vector α: },K _α k .

Bayesian Multinomial Test for Exact Equality Constraints

Consider any two hypotheses H1 and H2 and a data vector x. The Bayes factor can be

seen as the change from prior model odds–that is the ratio of the prior probabilities of both hypotheses–to posterior model odds (Kass & Raftery, 1995), and as such quantifies the evidence provided by the data x for one hypothesis versus the other:

p(H1 | x) = p(H1) × p(x | H1) . (6) p(H2 | x) p(H2) p(x | H2) poste rior ., prior ., _Bayes factor

The Bayes factor describes the predictive adequacy of two competing hypotheses, i.e., com- =

(10)

. (7)

− H

− − H H

− H H

peting statistical models, on a continuous scale. In Bayesian inference, it is considered as “the standard Bayesian solution to the hypothesis testing and model selection problems” (Lewis & Raftery, 1997, p. 648). Table 1 summarizes common interpretations of the Bayes factor. As apparent from Equation 6 the Bayes factor, denoted as BF, for H1 versus H2

corresponds to the ratio of the marginal likelihoods of the competing hypotheses of interest:

Table 1

BF12 = p(x | H1) p(x | H2)

Commonly used interpretation categories of the Bayes Factor BF12.1

Bayes Factor BF12 Interpretation

> 100 Extreme evidence for 1

30 100 Very strong evidence for 1

10 30 Strong evidence for 1

3 10 Moderate evidence for 1

1 3 Anecdotal evidence for 1

1 No evidence

1/3 − 1 Anecdotal evidence for H2

1/10 − 1/3 Moderate evidence for H2

1/30 − 1/10 Strong evidence for H2

1/100 − 1/30 Very strong evidence for H2

< 1/100 Extreme evidence for H2

Note. BF12 = 1/BF21. 1 Obtained from Lee and Wagenmakers (2014, p.105).

Bayesian Reanalysis of Habermann (1978)

For concreteness, we will demonstrate the Bayesian multinomial test for exact equality constraints by reanalyzing the research question of Habermann (1978). For comparison, we will also report the original p-value obtained from null hypothesis statistical testing.

The null hypothesis in Habermanns (1978) research question entails that the probability of reporting a negative life event is equally distributed over the 18 months prior to the interview. In particular, we are interested in the hypothesis

H0 : θ1, θ2, · · · , θK = 1/K.

This restriction is referred to as exact equality constraint, which means that we assign exact probabilities to each of the categories. H0 will be tested against an encompassing hypothesis

(11)

BAYESIAN MULTINOMIAL TEST 11 Eq. 5 K K k=1 Γ (xk + 1) × 1) × · · · × Γ

He, which states that all parameters are free to vary: He : θ ∼ Dir(α).

The frequentist multinomial test yields χ2_{(17) = 45}_{.367, p < .001 which means that we}

can reject the null hypothesis that the cell probabilities are equally distributed over the individual months. However, the frequentist test cannot quantify how much we have shifted our beliefs based on the data.

When testing exact equality constraints, the Bayes factor for He versus H0 can be

obtained analytically. By assuming that every parameter value is equally likely before we see any data, we can capture this information using a uniform prior distribution across the parameter vector θ, such that, p(θ | He) ∼ Dir(α) with all parameter values in α set to 1. Using the observed frequencies from Haberman (1978)

x = (15, 11, 14, 17, 5, 11, 10, 4, 8, 10, 7, 9, 11, 3, 6, 1, 1, 4)I, we can obtain the marginal likelihood of He by:

p(x | H ) = r p(x | θ)p(θ | H ) dθ (8) e Γ (},K e xk + 1 ' 1 k=1 K _{Γ (x} k + 1) × B(x + α) × B(α) k=1 = Γ (15 + Γ (147 + 1) (4 + 1) × Γ (15 + 1) × · · +· × Γ (4 + 1) × Γ (1) Γ (18) _{Γ (1)} Γ (147 1) × · · · × Γ Γ (147 18) × · · · × = Γ (147 + 18) + 1) × Γ (18) = 1.87 × 10−23.

Similarly, we obtain the marginal likelihood of H0 by:

p(x | H0) = p(x | θ = θ0) p(θ = θ0 | H0) (9) Γ (},K xk + 1 ' = Γ (15 + Γ (147 + 1) (4 + 1) × 1/18 15 _{× · · · × 1/18}4 _{× 1} = 6.89 × 10−25. k=1 k=1 = k=1 = θxk × 1 k

(12)

.

By calculating the ratio of marginal likelihoods, we yield the Bayes factor in favor of the encompassing hypothesis: BF = p(x | He) = 1.87 × 10 −23 = 27 1 e0 _p(x_{| H} 0) 6.89 × 10−25

A Bayes factor of 27.1 indicates that the data are about 27 times more likely to have occurred under He than under H0. Considering Table 1 we collected strong evidence in

favor of the encompassing hypothesis He which means that we can confirm Habermanns

(1978) results.

Relation to the Savage-Dickey Density Ratio

For exact equality constraints the Bayes factor of He versus H0 can be obtained alternatively

by dividing the height of the posterior density of He at the point of interest θ0 by the height

of its prior density at θ0. The resulting ratio, is generally known as the Savage-Dickey

density ratio (Dickey, 1971; Dickey & Lientz, 1970; Wagenmakers, Lodewyckx, Kuriyal, & Grasman, 2010).

We can derive the Savage-Dickey density ratio as follows (see also O’Hagan & Forster, 2004, pp.174–177, Wagenmakers et al., 2010, pp.183–184). The Bayes factor is defined as the ratio of marginal likelihoods:

BFe0 = p(x | He) = p(x | H0)

p(x | θ) p(θ | He) dθ p(x | θ)p(θ | H0) dθ

By letting the density of θ converge to the values of specific interest θ → θ0 we can rewrite

Since θ converged to the values of specific interest, its prior mass centers on this value and therefore becomes 1. Now we are only left with the likelihood function at the point of interest p(x | θ = θ0). Now we reformulate the likelihood function by applying Bayes’ rule

to the expression above:

(13)

BAYESIAN MULTINOMIAL TEST 13 p | θ θ0 p θ θ0 | He p(x | θ = θ0) = _{p(θ = θ} 0 | H ) , e 0 e (x = ) ( = ) p(θ = θ0 | x, H ) = e _{p(x | H} e) p(θ = θ0 | He) p(x | He) p(θ = θ0 | x, He) p(x | He) p(θ = θ0 | x, He) e

Hence, we can express p(x | H0) as:

p(x | H0) = p(x | θ = θ0)

= p(x | He) p(θ = θ0 | x, He) p(θ = θ0 | He)

where p(θ = θ0 | He) and p(θ = θ0 | x, He) is the prior and posterior density under He

evaluated at the point of interest θ0. The Savage-Dickey density ratio is then obtained by:

BF = _e0 p(x | He) = p(x | He) = p(θ = θ0 | He) . (10) p(x | H0) _p(x_{| H ) p(θ = θ | x, H )} p(θ = θ0 | x, He)

p(θ = θ0 | He)

To demonstrate the correspondence between the Savage-Dickey density ratio and calculation of the ratio of marginal likelihoods, we will apply the method to our working example. In the present situation our null hypothesis contains specific values for the probabilities to report a negative life event. These values represent our specific point of interest:

θ0 : θ1, θ2, · · · , θK = 1/K.

The Bayes factor BFe0 is given by the ratio of the height of the posterior and the prior

distribution of θ under He at θ = θ0:

BFe0 = p(θ = θ0 | He) p(θ = θ0 | x, He) p(x | θ = θ0)−1 =

(14)

BAYESIAN MULTINOMIAL TEST 14 1 k=1 k=1 k K _θαk−1 = B(α) 1 K _θxk +αk−1 B(x + α) 1 k=1 k 0 0 = 2.81 × 10₁ −15 × 1/18 × · · · × 1/18 15 4 2.27 × 10−198 × 1/18 = 27.1, × · · · × 1/18

which illustrates the equivalence of the Savage-Dickey density ratio and the calculation of the marginal likelihood ratios.

Relation to the χ2 _{Goodness-of-Fit Test}

It should be noted that the multinomial test can be seen as special case of the χ2 _goodness-

of-fit test. The χ2 _{goodness-of-fit test investigates whether an observed distribution of a}

categorical variable corresponds to expected proportions. Since H0 can be seen as testing

the parameter vector θ against specific values of interest–or expected proportions in other words–these two tests are equivalent. In particular, the vector containing the expected proportions θ0 can take any values as long as the restriction },K θk = 1 holds. The Bayes factor can then be obtained by calculating the Savage-Dickey density ratio from Equation 10.

Bayesian Multinomial Test for Inequality Constraints

In practice, the expectations of researchers are often formulated as so called inequality constraints in which we stipulate a certain order or trend on the parameters θk (Hoijtink,

2011). Expectations for a certain ordering in the data become most apparent in the context of replication research where we want to test if the data of the replication study show the same trend as the original study (e.g., Verhagen & Wagenmakers, 2014). For instance, the replication hypothesis that could arise from the Haberman (1978) data could be that the proportion of reports for a negative life event are decreasing the more we get back in time. A Bayesian approach of testing order constrained hypotheses for the multinomial test was first applied by Sedransk, Monahan, and Chiu (1985) and has since then inspired researchers to use this approach for different application areas (Klugkist, Laudy, & Hoijtink, 2010). One area of application is the evaluation of axioms in measurement theory. Ax- iomatic theories, for instance in decision-making theory or item-response theory (IRT) aim to express potential empirical laws. However, these axioms are formulated deterministically,

(15)

(θ | H ) = =

(θ | x, H ) = = .

which means that they do not take into account the actual behavior of the participants or measurement error (Falmagne, 1976). By using probabilistic frameworks like the Bayesian framework we can account for the uncertainties in the measurements. For instance, one of the key assumptions of IRT for dichotomous items is latent monotonicity (Grayson, 1988). The assumption states that the probability to solve an item increases with the ability of the person. By translating this assumption into an order constrained hypothesis researchers were able to test these theories within the Bayesian framework (see e.g., Karabatsos, 2001; Karabatsos & Sheu, 2004; Myung, Karabatsos, & Iverson, 2005; Tijmstra, Hoijtink, & Sijtsma, 2015).3

Unlike the multinomial test for exact equality constraints, the prior and posterior density of an order restricted hypothesis Hr is not available in closed form. However, we

can define these densities with respect to He. Following this idea, the prior density of the

order restricted hypothesis Hr can be obtained by restricting the parameter space of θ

under He. This can be done by incorporating an indicator for Hr, that takes the value 1 if

the parameter values are in accordance with the constraint and 0 otherwise. Consequently, the prior distribution for Hr takes the following form:

p(θ | He)IR(θ) p(θ | He)IR(θ) ₍₁₁₎

p r

θ∈R p(θ | He) dθ p(θ ∈ R | He)

where IR(θ) denotes the indicator function of Hr and R denotes the restricted parameter

space. The posterior distribution of θ under Hr can be set up in a similar way:

p(θ | x, He) IR(θ) p(θ | x, He) IR(θ) ₍₁₂₎

p r

θ∈R p(θ | x, He) dθ p(θ ∈ R | x, He)

By restricting the parameter space of He to the values that obey the order constraints we

can express the Bayes factor as:

BFer = p(x | He) p(x | Hr)

3_{Not only categorical data but also other statistical procedures are developed to evaluate inequality} contrained hypothesis within the Bayesian framework, for instance within repeated measurement designs (Mulder et al., 2009), contingency tables (Klugkist et al., 2010) and t-tests (Morey & Wagenmakers, 2014).

(16)

Klugkist et al. (2005) have shown that by formulating the prior and posterior density under Hr in terms of He the Bayes factor BFer is obtained by integrating over the restricted

parameter space of the prior and posterior density under He:

Note that the Bayes factor also represents a ratio of marginal likelihoods, namely the marginal likelihoods of the restricted prior and posterior density displayed in Equation 11 and 12. However, the ratio in Equation 13 cannot be evaluated analytically. The following section introduces sampling based methods that have been developed to approximate the Bayes factor in Equation 13 for categorical variables. These methods are the encompassing prior (EP) approach proposed by Klugkist et al. (2005) and the conditioning method by Mulder and colleagues (Mulder et al., 2009; Mulder, Hoijtink, & Klugkist, 2010). After- wards, we will propose an alternative to the existing methods to estimate the Bayes factor; the bridge sampling method proposed by Bennett (1976) and Meng and Wong (1996).

Encompassing Prior Approach

The EP approach by Klugkist et al. (2005) is intuitive and easy to implement; Klugkist et al. (2005) argue we can approximate the ratio in Equation 13 by drawing MCMC samples from the prior and posterior distribution of θ under He, and only consider the samples

that fall into the restricted area θ ∈ R. The Bayes factor is then defined as ratio of prior and posterior mass of θ under He that is consistent with the order constraint (Hoijtink,

(17)

BAYESIAN MULTINOMIAL TEST 17 }, j=1 BˆFer = I I i=1 1 IR (θi | He) , (14) },_J _I R (θj | x, He)

where I denotes the total number of MCMC samples drawn from the prior distribution of θ under He and J denotes the total number of MCMC samples drawn from the posterior

distribution of θ under He. In addition to inequality constraints, the EP method also allows

for exact equality constraints by letting the constraint area converge to the point of interest θ ∈ R → θ0. Then the EP approach for exact equality constraints represents a general-

ization of the Savage-Dickey density ratio (see e.g., Hoijtink, 2011; Wetzels, Grasman, & Wagenmakers, 2010).

The EP approach delivers robust estimates if the observed data contain a small number of categories. In addition, the EP approach only requires samples from the encompassing hypothesis which are in general easy to sample from. However, for a medium or large number of categories the restricted parameter space becomes too small. As a result it becomes difficult to obtain enough samples that are in accordance with the order constraint to ef- fectively estimate the prior and posterior distribution under Hr (Mulder et al., 2009). This

results in unstable estimates and makes the encompassing prior approach computationally very costly for problems including a large number of categories.

Conditioning Method

Mulder et al. (2009, 2010) proposed an alternative method which yields stable estimates of the Bayes factor even for models with a higher number of categories. Let M denote the total number of order constraints under the restricted hypotheses (with m = 1, 2, · · · , M ) and Rm denote the mth order constraint. Instead of evaluating all order constraints at

once the authors propose to successively evaluate every order constraint Rm individually

and condition on all previously evaluated constraints R1, · · · , Rm−1.

For instance, the Bayes factor for the encompassing hypothesis versus the hypothesis that implies only the first order constraint BFe,R1 is estimated by evaluating the ratio of the

prior and posterior mass of θ under He which is in accordance with the first order constraint R1. Here we require only samples from the prior and posterior distribution under He. To

compute this ratio we then use a slightly adjusted version of Equation 14: J

(18)

BAYESIAN MULTINOMIAL TEST 18 }, J j=1 1 BˆFe,R1 = I I i=1 1 IR1 ( θi | He) },_J IR (θj | x, He)

The second Bayes factor conditions on the previously evaluated restriction R1 and thus

evaluates the Bayes factor of the hypothesis that the first order constraint holds versus the hypothesis that the first and second order constraint holds BF_R₁_,R₂_|R₁. This means we require samples from a truncated distribution which imposes the first order constraint R1

but leaves the other parameters free to vary. Then again we evaluate the prior and posterior mass of θ that is in accordance with the second order constraint R2.4 This procedure is

then repeated until we calculated a Bayes factor for every order constraint. Note that when assuming a uninformative prior distribution, the prior mass of θ that obeys the mth order restriction simplifies to ₍m+1) 1 , which reduces the computation costs of drawing prior samples

for each imposed restriction substantially. By only evaluating one order constraint at a time the conditioning method ensures a sufficient number of samples that obey the investigated constraint and therefore yields stable estimates for the individual Bayes factors. By taking into account the property that Bayes factors are transitive ( see e.g., Morey & Rouder, 2011) the Bayes factor of the encompassing versus the restricted hypothesis BFer is then

obtained by simply multiplying all individually obtained Bayes factors:

BˆFer = BˆFe,R1 × BˆFR1,R2|R1 × · · · × BˆFRM −1,RM |R1,··· ,RM −1 . (15)

Although the conditioning method yields estimates that are more stable than the EP approach it comes with an considerable increase of computational cost, since it requires posterior samples for each order constraint. Especially when evaluating a large number of order constraints the sampling procedure is very time costly.

Bridge Sampling Method

We introduce bridge sampling as an alternative method to compute the ratio of marginal likelihoods (Bennett, 1976; Meng & Wong, 1996). The bridge sampling method promises to obtain robust estimates of the Bayes factor while being computationally efficient (Gronau et al., 2017). This property represents a clear advantage to the EP approach and the conditioning method.

4_{A method to sample from a truncated Dirichlet distribution is outlined after introducing the bridge} sampling method.

1

(19)

BAYESIAN MULTINOMIAL TEST 19 . (16) p(θ ∈ R | x, He) p(θ ∈ R | x, He) _θ∈R_b(θ)p(θ| H_e₎_p(θ| x, H_e_{) d}θ p θ | , He e _p(θ ∈ R | x, He) p θ | He θ∈R e _p(θ_{∈ R | H}_e₎

As we determined before, the Bayes factor in Equation 13 can be formulated as problem of estimating normalizing constants–namely the normalizing constants for the truncated prior and posterior distribution. Consider two densities of interest p1 = q1/c1 and p2 = q2/c2

with overlapping support in their parameter space Ω1 ∩ Ω2. Bennett (1976) showed that for

any bridge function b(θ) defined on Ω1 ∩ Ω2 which has the property

0 < r

Ω ∩Ω b(θ)p1(θ)p2(θ) dθ < ∞

1 2

we yield the following identity:

1 = Ω1∩Ω2 b(θ)p1(θ)p2(θ) dθ

Ω1∩Ω2 b(θ)p1(θ)p2(θ) dθ

To test inequality constraints we define p1(θ) and p2(θ) as the prior and posterior densities under the encompassing hypothesis, with Ω1 ∩ Ω2 corresponding to the restricted parameter

Note that with the bridge sampling method we need to be able to sample from the order constrained prior and posterior distribution in order to compute the Bayes factor. Under the assumption that the prior and posterior samples are independent Bennett (1976) presented an optimal choice for the bridge function b(θ) which reduces the asymptotic variance of log(BˆFer ). The optimal bridge function is given by:

(20)

BAYESIAN MULTINOMIAL TEST 20 I J , r _I+J r BF BF ., = ( ) = θ b(θ) ∝ 1 × p(θ | H ) + × BF × p(θ | x, H )

where I denotes the number of samples from the prior distribution of the restricted model, J denotes the number of samples from the posterior distribution of the restricted model. Since the optimal choice of the bridge function depends of the Bayes factor itself, Meng and Wong (1996) suggested an iterative procedure, which uses the optimal choice of the bridge function, based on the previous estimate of the BˆFer :

1 l2 },_J ˆ ( ) ˆ (t+1) _er J 1 j=1 I I+J I+J J 1 BF t er _, ₍₁₈₎ },_I ˆ ( ) where I i=1 I I+J I+J J er t l1 =p(θj | x, He) = B(α) _θxk and j l2i p(θj | He) = p (θ_{p θ}i | x, H_i| H_ee) B(α + x) B(α) B(α + x) jk k=1 xk ik with k=1 θi ∼ p(θ | Hr), θj ∼ p(θ | x, Hr) .

samples from the

prior distribution osterior distribution samples from the

Meng and Schilling (2002) demonstrated rapid convergence (e.g., five iterations or less) of the bridge sampling estimator.

Sampling from a Truncated Dirichlet Distribution

In the case of order constrained hypotheses the conditioning method and the bridge sampling method require samples from the prior and posterior distribution of a truncated Dirichlet random variable. The easiest method would be to directly sample from the truncated posterior distribution but considering the dependent parameter space of the Dirichlet distribution this is a challenging task. Since all parameters from a truncated Dirichlet distribution im- ply as upper and lower bound the parameter values from their neighbors, the prior and posterior distribution is not available in closed form and direct sampling is impossible.

K l1 + l2 + K I+J ,

(21)

k=1 k

.

Furthermore, we could draw samples from the encompassing hypothesis and reject every sample that disobeys the order constraint. Ultimately these samples would converge to the truncated distributions. But this method is computationally very costly and time con- suming. For instance, in our simulation study even by drawing 1, 000, 000 samples from the encompassing hypothesis no sample obeyed the order constraint. Therefore, we turn to guidelines proposed by Damien and Walker (2001) in order to sample from a truncated Gamma distributions. These guidelines are:

1. Dissolve the dependent parameter space of the Dirichlet random variable by transforming it into independent Gamma random variables.

2. Implement a Gibbs sampler to bypass the difficulty to directly sample from a truncated distribution and instead sample from the easier full conditional distribution.

3. Use the inverse CDF technique which simplifies sampling from a truncated Gamma distribution to sampling from a uniform distribution.

4. Transform the sampled independent Gamma variables back into Dirichlet variables. Bellow we describe each step in turn.

(1) Transforming the Dirichlet random variable. Let Z be a Gamma ran- dom variable described as Z ∼ Gamma(α, β). Devroye (1986) proved that the parameter vector of the Dirichlet distribution θ can be transformed into a series of K independently distributed Gamma random variables:

1 θ1, · · · , θK = },_K

Z (Z1, · · · , ZK ) ∼ Dir(α1, · · · , αK ). (19) By setting the rate parameter of the Gamma distribution to β = 1 the PDF of the individual Gamma random variables is given by:

Zαk−1exp(−Z

k ) p(Zk ) = k _Γ(α

k )

By using this transformation we translate order constraints imposed on the parameter vector θ–e.g., θ1 < θ2 < · · · < θk –into constraints of the individual Gamma random variables: Z1 < Z2 < · · · < Z_k. Hence, we are moving away from sampling from the constrained

parameter space of the Dirichlet distribution R and instead sample from the constrained parameter space of the individual Gamma distributions RZ . Since p(θ ∈ R | He) can be

(22)

BAYESIAN MULTINOMIAL TEST 22 K · · · Z k )Zk k k ∞ is defined as: r∞r∞ _r 1 _{α −1} k−

Similarly, p(Z ∈ RZ | x, He) is obtained by substituting αk with αk + xk .

(2) Implementing the Gibbs sampler. Next we will embed the problem in the Gibbs sampling framework (Geman & Geman, 1984; Gelfand & Smith, 1990). Damien and Walker (2001) showed that Gibbs sampling–by introducing an additional random variable Y –simplifies the difficulty of sampling from a truncated Gamma distribution.

Gibbs sampling aims to draw samples from at least two random variables by iteratively sampling from their full-conditional distributions, since these are usually easier to sample from then their joint distribution (Casella & George, 1992). The full conditional distribution describes the conditional distribution of a random variable given the current or known values of the random variables of all other distributions in the Gibbs sampler.

A Gibbs sampling cycle with two variables involves the following steps: First, the parameter values of the first distribution are initialized. Second, samples from the second distribution are drawn by conditioning on the initialized parameter values of the first density. Third, samples from the first distribution are drawn, by conditioning on the drawn values of the second density. Finally, the cycle is reinitialized and starts again. Geman and Geman (1984) show that under an increasing number of iterations, the samples converge to the true joint density distribution.

In our case, we extend our framework with an additional random variable Yk , because

sampling from the full conditional distribution is easier then sampling directly from the truncated Gamma distribution. The density of the kth truncated Gamma distribution is given by: p(Zk ) = Zαk−1exp(−Z k )I ( Zk ∈ (a, b) ' ,

with lower and upper bound of Zk given by 0 ≤ a < b ≤ ∞. We introduce a random

variable Yk such that the joint density of the two variables p(Zk , Yk ) is defined as

p(Zk , Yk ) ∝ Zαk−1I

(

Yk < exp(−Zk ), Z_k∈ (a, b) '

.

We see that the truncated Gamma density presents itself as marginal distribution by inte- grating over the parameter space of Yk :

Γ(αk 1 k=1 0 Z1

(23)

BAYESIAN MULTINOMIAL TEST 23 r k k r k k k k k ( ' ( ' exp(−Zk ) p(Zk ) = Zαk−1 _dY k I ( Zk ∈ (a, b) ' exp(−Zk ) = Zαk−1 0 1 dY_k = Zαk−1exp(−Z k ) I ( Zk ∈ (a, b) ' .

The joint distribution has properties that make it easy to sample from the full conditional distributions. For instance, by integrating the joint distribution over Zk the full conditional

density for p(Yk | Zk ) yields a simple uniform distribution:

p(Yk | Zk ) ∼ Uniform(0, exp(−Zk )).

The full conditional density for p(Zk | Yk ) is derived as follows. Based on p(Yk | Zk ) we

know that p(Zk , Yk ) ∝ Zαk−1I ( Yk < exp(−Zk ), Z_k∈ (a, b) ' = Zαk−1I ( Yk < exp(−Zk ) ' I(Zk ∈ (a, b) ' .

We can rewrite I Yk < exp(−Zk ) to I Zk < −log(Yk ) by reversing the inequality and

bring Zk to the left side of the equation. This means that the upper bound of p(Zk | Yk )–

furthermore denoted as UY –is the smaller value of {b, −log(Yk )}; UY = min(b, −log(Y_k)). The lower bound of p(Zk | Yk )–furthermore denoted as Lk –is Lk = a. Therefore, the full conditional density p(Zk | Yk ) is defined as

p(Zk | Yk ) ∝ Zαk−1I ( Zk ∈ (Lk , UY ) ' . k k

Since integrating p(Zk | Yk ) over Zk yields U_rYk Zαk−1 _dZ = 1 ( U αk − Lαk ' , k k _α k Yk k Lk 0

(24)

BAYESIAN MULTINOMIAL TEST 24 Yk k r = r = z _k − Lα _k k k k .

the proper probability density function of p(Zk | Yk ) is given by: αk Zαk−1 I ( Zk ∈ (Lk , UY ) ' p(Zk | Yk ) = k k U αk − Lαk

(3) The inverse CDF technique. Drawing samples from p(Yk | Zk ) is solved by

simply sampling from a Uniform distribution. With the inverse CDF technique Damien and Walker (2001) show that sampling from p(Zk | Yk ) is also reduced to sampling from a

Uniform distribution. The starting point of this technique is the cumulative density function (CDF) of p(Zk | Yk ): αk zαk−1 P (Zk | Yk ) = Uαk − Lαk dz, z ∈ (Lk , Uk ) Lk k k Z_k α_k Uαk − Lαk zαk−1 dz k _{k L} k αk k k Uαk − Lαk

The inverse CDF technique is based on the property that every parameter value in a cumulative density function is assigned a unique value in the range from 0 to 1. Thus, we can assume that the value of P (Zk | Yk ) comes from a uniform distribution:

P (Zk | Yk ) ∼ Uniform(0, 1).

By solving P (Zk | Yk ) for Zk we are able to retrieve the parameter value Zk given a sample

from a uniform distribution. The value Zk is then given by: Zk = α

I

k P (Z_k| Y_k)(Uαk − Lαk ) + Lαk . (20)

Now the Gibbs sampler can be implemented. One Gibbs sampling cycle proceeds as follows. (1) The parameter values for Z are initialized. (2) Values for Y1 are obtained by sampling

from p(Y1 | Z1) ∼ Uniform(0, exp(−Z1)). (3) Values for Z1 by are obtained by first sampling

from a Uniform distribution Uniform(0, 1) and then applying Equation 20. The steps (2) and (3) are repeated K times. As final step the obtained Gamma random variables Z are back-transformed and saved as Dirichlet variable by applying Equation 19. Then the Gibbs cycle starts again.

.

(25)

BAYESIAN MULTINOMIAL TEST 25 Simulation Study

To investigate the behavior of the described methods to calculate the Bayes factor for inequality constrained hypotheses we conducted a simulation study with synthetic data. The simulation study investigates the Bayes factors of the three methods under the condition of increasing model complexity as well as increasing evidence for the restricted hypothesis. The restricted hypothesis under consideration entails that the model parameters θK are

increasing:

Methods

H+ : θ1 < θ2 < · · · < θK .

Simulation conditions. We conducted 4 sets of simulations, each with differ- ing model complexity by means of an increasing number of categories (5, 10, 15 and 20 categories). For each set of simulations we generated 11 synthetic data sets with N = 170, 240, 360 and 480 observations respectively and increasing evidence for H+. We

increased the evidence for the restricted hypothesis by means of the correlation coefficient ρ; this means, that a strong correlation between the data and the category indexes indicates strong evidence for H+. We then computed the Bayes factor BF+e = 1/BFe+ in favor for

the restricted hypothesis with the EP approach, the conditioning method and the bridge sampling method for each data set.

Generating the data sets. It should be noted that we aimed for exact correlations coefficients for every simulated data set while keeping the number of observations reason- ably small. This is why we allowed for non-integer counts in our data set, which means that strictly speaking they do not originate from a Multinomial distribution. However, the calculation of the Bayes factors depends on the parameter vector α of the Dirichlet dis- tribution; this vector can take integer as well as non-integer values which means that the calculation of the Bayes factor is not affected by our data choice.

To generate the data sets we turned to our advantage that we can transform two vectors in such a way, that the angle between them and thus their correlation equals a predefined value ρ = ρ0 (see MacFarlane, 2013 for details). Given two vectors x1 and x2

with length 1 and mean 0, we know that the angle φ between those vectors equals the inverse cosine of their correlation,

φ = cos−1(ρ).

If we set the correlation to a particular value we can compute the corresponding φ0 by φ0 = cos−1(ρ0).

(26)

2 2

2

and shift x2 in such a way that the correlation between x1 and x(ρ0) corresponds to ρ0:

x(ρ0) _{= x}₂_{+ 1}_/tan(φ₀_{) x}₁_.

To create synthetic data sets we specified x1 as a vector including the index values of the

categories:

x1 = (1, 2, · · · , K)I,

and x2 containing the same amount of entries randomly generated from a normal distribu-

tion:

x2 ∼ Normal(2, 0.5).

Next, both vectors were scaled in such a way so they were (1) orthogonal to each other and (2) both had length 1 and a mean of 0. Then we transformed x2 so that the correlation

between x(ρ0) _{and x}₁_{corresponded the values of ρ}₀_{. For every data set, we subsequently}

increased the correlation from 0 to 1 in steps of 0.1, whereas the last value was set to 0.99 instead of 1. Lastly, we applied a linear transformation on x(ρ0) _{to increase the standard}

deviation of the values and to match the total number of observations per model condition to N = 170, 240, 360 and 480, respectively. Note that correlations between vectors are invariant with respect to linear transformations. The full tables including the simulated data sets can be found in Appendix B.

Sampling details. For the simulation sets containing models with 5 and 10 cate- gories we drew a total of 1, 000, 000 prior and posterior samples under the unrestricted and restricted hypothesis. We excluded the first 4, 999 values as burn-in; to avoid autocorrela- tion of the chains we kept every 100th value of the chain, yielding 9, 951 prior and posterior samples. Then, we computed the Bayes factors for the EP approach, the conditioning method and the bridge sampling method.

Since the models containing 15 and 20 categories are considerably larger, we decided to draw a total of 3, 000, 000 samples from the prior and posterior distribution under the un- restricted and restricted hypothesis. From these samples we excluded the first 4, 999 values as burn-in. In addition we retained every 200th value of the chain to reduce autocorrelation, yielding 14, 976 prior and posterior samples.

To have an indication of the sampling error we repeated the Bayes factors estimation 100 times in the simulation set including models with 5 categories. Out of this sample we calculated the mean Bayes factor and the corresponding 95% credible interval. This method to quantify uncertainty was only applied for the model condition with 5 parameters due to its high computational costs.

(27)

BAYESIAN MULTINOMIAL TEST 27 Results

The results for the model containing 5 categories are summarized in Figure 2. The results of the simulations containing models with 10, 15 and 20 categories are displayed in Figure 3. For the smallest model, the three methods produce identical results. In the other model conditions the EP approach failed to produce a Bayes factor, since none of the prior and posterior samples under the unrestricted hypothesis obeyed the order constraint. It is evident from the figures that for small and medium sized models containing up to 15 categories the conditioning method and the bridge sampling method correspond to each other. For the model with a large number of categories, however, the Bayes factors deviate from each other. Under this condition, the bridge sampling method seems to be heavily biased towards the restricted hypothesis especially if the data do not support the restricted hypothesis.

For instance, if the data shows a complete absence of evidence for H+ indicated by

a correlation of ρ = 0 the conditioning method estimates a Bayes factor of BF+e = 0.40 in favor for the restricted hypothesis, indicating anecdotal evidence for the He according to the

common interpretation categories of the Bayes factor (see Table 1). The bridge sampling method, however, estimates a Bayes factor of BF+e = 41.76 indicating very strong evidence for H+. We therefore conclude that the bridge sampling method in its current form fails to

calculate accurate estimates of the Bayes factor for large models. The complete tables of the log Bayes factors for each model condition are displayed in Table 2 and 3.

(28)

BAYESIAN MULTINOMIAL TEST 28 4 3 2 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 ρ

Figure 2 . This figure displays the mean log Bayes factors in favor for the restricted hypoth- esis of the three methods assuming a model with 5 parameters under increasing evidence for the restricted model by means of ρ. Under this easy model, all three methods correspond perfectly to each other.

Table 2

Mean and 95% credible intervals for the log Bayes factors in favor for the restricted hy- pothesis for the encompassing prior (EP) approach, the conditioning method and the bridge sampling method for the model with 5 parameters under increasing evidence for the restricted hypothesis. The total number of observations was N = 170.

0.152 [−0.130, 0.432] 0.158 [0.092, 0.253] 0.6 0.896 [0.672, 1.086] 0.899 [0.841, 0.947] 0.900 [0.781, 1.021] 0.7 1.224 [0.976, 1.503] 1.191 [1.139, 1.241] 1.183 [1.071, 1.296] 0.8 1.932 [1.696, 2.194] 1.918 [1.877, 1.955] 1.922[1.841, 2.014] 0.9 2.707 [2.505, 2.917] 2.701 [2.665, 2.734] 2.696[2.619, 2.772] 0.99 4.57 [4.38, 4.84] 4.570 [4.560, 4.580] 4.568[4.508, 4.624]

Bridge sampling method Conditioning method EP approach Log( B F+e )

Correlation EP approach Conditioning method Bridge sampling method 0 0.1 0.2 0.3 0.4 0.5 −0.64 [−0.95, −0.29] −0.70 [−1.08, −0.34] −0.46 [−0.78, −0.07] −0.26 [−0.597, 0.081] 0.359 [0.127, 0.644] −0.67 [−0.73, −0.58] −0.69 [−0.76, −0.60] −0.45 [−0.52, −0.39] −0.27 [−0.33, −0.20] 0.372 [0.305, 0.437] −0.68 [−0.84, −0.53] −0.67 [−0.81, −0.53] −0.45 [−0.60, −0.32] 0 −0.27 [−0.40, −0.15] .165 [0.025, 0.283] 0.383 [0.273, 0.504]

(29)

Log(BF+e) for model with 10 parameters

20 18 16 14 12 10 8 6 4 2 0 −2 −2 0 2 4 6 8 10 12 14 16 18 20

Bridge Sampling Method

Model with 15 parameters

20 18 16 14 12 10 8 6 4 2 0 −2 −2 0 2 4 6 8 10 12 14 16 18 20

Model with 20 parameters

25 23 21 19 17 15 13 11 9 7 5 3 1 −1 −1 1 3 5 7 9 11 13 15 17 19 21 23 25

Figure 3 . This figure displays the log Bayes factors in favor for the restricted hypothesis of the bridge sampling method and the conditioning method for models with 10, 15 and 20 parameters under increasing evidence for the restricted hypothesis. For the medium sized models with 10 and 15 parameters the methods yield fairly consistent estimates of the Bayes factor. However, it seems that under larger models the log Bayes factor of the bridge sampling method becomes inaccurate, especially in cases with low evidence for the restricted hypothesis. C on d it io n ing M et ho d C ondi ti oni ng M e thod C ond it ion ing M e thod

(30)

− − − − − − − − − − − − − Table 3

Log Bayes factors in favor for the restricted hypothesis for the encompassing prior approach, the conditioning method and the bridge sampling method for models with 10, 15 and 20 parameters under increasing evidence for the restricted hypothesis.

10 parameters 15 parameters 20 parameters

N = 240 N = 360 N = 480

Correlation EP

approach Conditioning method Bridge sampling method

EP

approach Conditioning method Bridge sampling method

EP

approach Conditioning method Bridge sampling method 0 – 0.46 0.05 – 1.54 2.17 – 0.91 3.732 0.1 – 0.41 0.45 – 1.12 2.11 – 0.53 4.110 0.2 – 0.174 0.098 – 0.138 1.49 – 0.448 4.474 0.3 – 0.592 0.731 – 1.234 0.68 – 0.743 4.458 0.4 – 0.583 0.520 – 0.882 2.40 – 1.406 5.149 0.5 – 0.891 1.297 – 1.703 0.386 – 1.930 5.053 0.6 – 1.414 1.629 – 2.885 0.574 – 3.131 5.238 0.7 – 2.005 2.290 – 3.651 2.676 – 5.124 7.028 0.8 – 2.982 3.358 – 5.055 4.601 – 6.741 6.518 0.9 – 4.542 4.790 – 7.958 7.549 – 10.20 11.56 0.99 – 10.70 10.74 – 19.26 19.29 – 25.01 24.97 30

(31)

BAYESIAN MULTINOMIAL TEST 31 Discussion

We conducted a simulation study to investigate the performance of the EP approach, the conditioning method and the bridge sampling method. For that we simulated data sets with increasing evidence for restricted hypothesis H+, which states that the model parameters θk are increasing for models with 5, 10, 15 and 20 categories, respectively.

The EP approach works well with models containing a small number of categories but is not applicable to larger models; it failed to estimate Bayes factors for medium sized or large models starting from 10 categories. By increasing the categories and therefore also the associated order constraints, the restricted area in the parameter space is simply too small, and obtaining a sufficient number of samples is too time consuming. The conditioning method performed well in all simulation conditions, which makes it a reliable method even for complex models. It should be noted, however, that the method is computational most costly, since it requires posterior samples for every individual restriction. The bridge sampling method failed to give accurate Bayes factor estimates for large models but performed well in small and medium sized models. It could serve as practical choice for such cases since this method is computationally much less costly than the conditioning method. Sub- sequently, we discuss reasons and possible improvements for the bridge sampling method for cases with a large number of categories.

Autocorrelation and Convergence. Meng and Wong (1996) showed that the use of the optimal bridge function b(θ) performs well under the assumptions that the draws from the two densities–in our case the truncated prior and posterior density–are independent. However, we obtained these draws by using the Gibbs sampling algorithm. This algorithm depends on the previous draws and subsequently discovers the parameter space. A reason for the poor performance of the bridge sampling method in large models could be that (1) the samples show a large autocorrelation and that this autocorrelation is uneven between the two densities and (2) the chains have not converged to their stationary distribution.

If the autocorrelation in the chains is high then the number of independent draws I and J is overestimated, whereby the bridge function b(θ) is no longer optimal. Meng and Schilling (2002) suggest to adjust I and J to the “effective size” by taking into account the a suitable measure for the autocorrelation ρa from the samples: I˜ = I(1 − ρa,i) and J˜ = J (1 − ρa,j ).

We decided to investigate only a sample from the data– namely a case with low correlation coefficient of ρ = 0.3–since the differences between the conditioning method and the bridge sampling method for the Bayes factor are particularity high. We calculated the range of the autocorrelation in the correlation condition ρ = 0.3 of the restricted prior and posterior distribution. The autocorrelation was calculated for all model conditions for a lag ranging from 1 to 200 and ranged from −0.029 to 0.032 with a mean of 0. A detailed

(32)

BAYESIAN MULTINOMIAL TEST 32 overview is given in Table 4. Based on these results we can assume that the samples are independent.

Next, we investigated the convergence of the chains by inspecting traces of the sam- ples. For that we selected randomly the model parameter θ3 plotted its trace for all model

conditions. The trace plots are displayed in Figure 4. Based on the trace plots of the chains, we can conclude that the chains have converged from their initial starting value to their stationary distribution.5 _{Since our draws from the two densities are in a good condition,}

we will further investigate the bridge sampler itself. Table 4

Mean and range of autocorrelation for all parameters in the differing simulation conditions with low evidence (ρ = 0.3) for the restricted model and a lag of 1. The autocorrelation is fairly low for all conditions, which indicates the independence of prior and posterior samples.

Mean and range of Autocorrelation Model condition Prior samples Posterior samples

5 0.001 [−0.027, 0.024] 0 [−0.021, 0.017] 10 0 [−0.020, 0.018] 0 [−0.029, 0.032] 15 0 [−0.021, 0.018] 0 [−0.026, 0.021] 20 0 [−0.023, 0.023] 0 [−0.024, 0.025]

Missing overlap between prior and posterior density. The precision of the bridge sampling estimator also depends on the overlap between the probability distributions p1

and p2, i.e., truncated prior and posterior distribution. Meng and Schilling (2002) showed that

the Hellinger distance–a measure capturing the similarity between two distributions– influences the variance of the logarithmic bridge sampling estimator. The Hellinger distance ranges from 0–which is reached if an only if p1 = p2–and a value of 1–which is reached if the two

distributions have 0% overlap (Nikulin, 2011). If the overlap between the distributions becomes too small–for instance, when the posterior is peaked around a specific value and the prior distribution is very wide–then the bridge sampling estimator might become unstable. To investigate the overlap between the prior and posterior distribution, we visualized the marginal prior and posterior distribution of parameter θ3 for every model and calculated the

corresponding Hellinger distance. The results are displayed in Figure 5. The missing overlap between the parameter spaces is evident for the model condition with 20 categories; here the posterior and the prior distribution mismatch strongly. This assumption is also supported by the Hellinger distance of 0.995 between the marginal prior and posterior

5_{Figures displaying the traces of all parameters with correlation coefficient of ρ = 0.3 are stored in an} online Appendix in the OSF folder.