• No results found

An asymptotic analysis of Gibbs-type priors

N/A
N/A
Protected

Academic year: 2021

Share "An asymptotic analysis of Gibbs-type priors"

Copied!
66
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

An asymptotic analysis of Gibbs-type priors

Carlo Kuiper

February 19, 2016

Master’s thesis

Supervisor: Bas Kleijn

Korteweg-de Vries Institute for Mathematics Faculty of Science

(2)

Abstract

One can use Bayesian inference to make future predictions for unobserved data with discrete nonparametric priors. Gibbs-type priors are a general class of discrete nonparametric priors. A framework is given for applying Bayesian inference from a fre-quentist perspective. Two applications of nonparametric priors are studied: species sampling and mixture modeling. We de-fine Gibbs-type priors by their exchangeable partition probabil-ity function. Then the asymptotic behaviour of these priors is analyzed: a convergence and a consistency theorem is proven. We conclude that under certain conditions Gibbs-type priors are consistent.

Information

Title: An asymptotic analysis of Gibbs-type priors Author: Carlo Kuiper, Carlo.Kuiper@student.uva.nl Supervisor: Bas Kleijn

Second reviewer: Harry van Zanten

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.kdvi.uva.nl

(3)

Contents

Introduction 2

1 Bayesian Nonparametrics 5

1.1 A Bayesian framework . . . 5

1.2 Bayesian definitions . . . 8

1.3 A frequentist view on Bayesian inference . . . 12

1.4 Applications of Bayesian inference . . . 15

1.4.1 Species sampling . . . 15

1.4.2 Mixture modeling . . . 16

2 Gibbs-type priors 21 2.1 Random Partitions . . . 21

2.2 Species Sampling Sequences . . . 25

2.3 Discrete Nonparametric Priors . . . 27

3 Limit behavior of Gibbs-type priors 30 3.1 Two limit theorems . . . 30

3.2 Convergence of Gibbs-type priors . . . 35

3.3 Consistency of Gibbs-type priors . . . 42

3.4 The support of Gibbs-type priors . . . 50

Conclusion 52

Appendix A Standard Borel spaces 53

Appendix B Regular conditional probability 56

(4)

Introduction

In this thesis we study Gibbs-type priors defined by a species sampling se-quence. In a species sampling sequence a finite sequence of observations is given, in which there are possibly different observations. These distinct obser-vations are commonly referred to as species. An important concern is, given the sequence of observations, to predict properties of future observations. Many statistical problems can be stated in such a context, for example:

• What will be the value of a new observation?

• Let n, m ∈ N and given X1, . . . , Xnobservations, how many new species

will be observed in m new observations?

To answer such questions we assign probabilities to new possible outcomes with a prediction rule based upon a Gibbs-type process. We consider a special case of Gibbs-type process, the Dirichlet process, to explain this procedure. The Dirichlet process can be characterized by a parameter θ > 0 and a diffuse probability measure P∗ on a measurable space (X , F ) (P∗(x) = 0 for all x ∈ X ). We assume there are n observations X(n):= X1, . . . , Xn. Let k ≤ n

denote the distinct observations X1∗, . . . , Xk∗, with frequencies n1, . . . , nk≥ 1

respectively. The probability of finding a new, not yet observed, value for the (n + 1)-th observation is parametrized by θ,

P (Xn+1= “new”|X(n)) =

θ

θ + n. (1)

Each new observation is distributed according to P∗. The probability that the (n + 1)-th observation equals a previous observation i ∈ {1, . . . , k} is

P (Xn+1 = Xi∗|X(n)) =

ni

θ + n. (2)

Consider the following example, where we let θ = 1 and P∗ the uniform dis-tribution on [3, 4]. Suppose we have collected a sequence of 5 observations

(5)

{X1, X2, X3, X4, X5} = {2, −0.11, 7.1, 2, 2} and are to predict the 6-th

obser-vation. Using equation (1) and equation (2) we can derive the probability density function (PDF) of X6, see figure 1 below. In this case the probability

of observing a new species equals 1/6.

1 2 3 4 5 6 7 X6 0.1 0.2 0.3 0.4 0.5 PHX6L

Figure 1: The PDF of X6 using a prediction rule based upon a Dirichlet

process with θ = 1, P∗ the uniform distribution on [3, 4] and observations {X1, X2, X3, X4, X5} = {2, −0.11, 7.1, 2, 2}.

If the data (Xi)i≥1 is independent and identically distributed (i.i.d.) with

measure P0, then we can generalize the prediction rule from (X1, . . . , X5) to

(X1, . . . , Xn). We will look at the limit of n → ∞ for probability measures

generated by a prediction rule similar to the one defined by equation (1) and (2). For a prediction rule specified with a Dirichlet process we will see that for many choices of θ and P∗ the generated probability measure converges to P0, in which case we will say that the Dirichlet process prior is consistent.

In fact, the choice of θ does not play a role for consistency.

The main focus of this thesis is to study a generalization of Dirichlet process priors: Gibbs-type priors. The research question is formulated as follows: In which cases are Gibbs-type priors consistent? The relevance of this question is that these priors have various interesting applications if they are consistent, including species sampling [14], mixture modeling [13], survival analysis [10], and so forth. In order to understand and answer the research question, mathematical properties of Gibbs-type priors are studied in depth.

(6)

chapters. When reading this chapter some general knowledge about measure theory is required, which can be found in many introductory books and syllabi in measure theory, for instance in Kallenberg’s Foundations of Modern Probability [11]. Also Bayesian inference is explained and formalized from a frequentist perspective. Finally, we study two applications of Bayesian inference: species sampling and mixture modeling.

The second chapter starts off by defining partition structures needed to un-derstand the various priors that are to be discussed. In particular we study infinite exchangeable random partitions and the relevant theorems. Then a species sampling sequence is introduced along with a characterization of discrete nonparametric priors by a prediction rule. The chapter is concluded by characterizations of a Dirichlet process prior, a Pitman-Yor process priors and a generalized version of both: Gibbs-type priors.

The third chapter covers the limit theorems for Gibbs-type priors. First we address the convergence theorem and then a consistency theorem. Finally we briefly study the support of Gibbs-type priors.

We finish with a conclusion, in which we briefly summarize the previous chapters, state results and look at open problems for further research. The subject of this thesis is, not surprisingly, similar to the article of De Blasi, Lijoi and Pr¨unster [3]. Upon reading this article, questions arose about the convergence theorem and then a consistency theorem. The starting point of this work was to show these proofs in detail and to fully understand the line of reasoning.

Lastly, I like to express my gratitude to my supervisor dr. Bas Kleijn for his scientific support and guidance in understanding such a complex mathemat-ical subject. I also thank him for many fruitful discussions we had. I must say that this master project challenged my thinking about statistics as this is my first in depth experience with Bayesian statistics. I would like to thank prof. dr. Harry van Zanten as well for being my second reviewer. I hope you will enjoy reading my master’s thesis titled: “An asymptotic analysis of Gibbs-type priors.”

(7)

Chapter 1

Bayesian Nonparametrics

We set up a theoretical framework for Bayesian statistics applied in the following chapters. The ingredients of a Bayesian inference procedure are explained: we study the data, the model, the prior and the posterior. We study Bayesian inference to define the posterior and give a view of the poste-rior from a frequentist perspective. We finish the chapter by looking at two relevant applications of Bayesian inference. The first application is species sampling, which helps to clarify the covered material. The second application is mixture modeling, a useful technique to approximate a probability density.

1.1

A Bayesian framework

Bayesian inference provides a mathematical way of combining new evidence with prior beliefs to understand, describe and estimate randomness of mea-sured data. It relies on four basic ingredients: the data, a model, a prior and a posterior. Before proceeding we give an intuitive explanation of all four.

1. The data is an observation or measurement which we denote by X. We consider X to be a random variable. Often the data consists of a sample of n independent and identically distributed (i.i.d.) random variables, denoted by X(n):= X

1, . . . , Xn.

2. The model is a collection P of all the candidate probability measures P ∈ P representing probability measures for the data X.

(8)

3. A frequentist would assume that the data X has a true probability dis-tribution P0which is contained in the model. However, from a Bayesian

perspective, it is not assumed that the data necessarily has a true prob-ability distribution. In the Bayesian philosophy the data has a random probability measure, that is drawn from the model according to a prior with a probability measure Π on the model. So the data X has a probability measure P , that is first drawn from P according to Π. 4. Moreover, one can look at Π(·|X). This is again a probability measure

on the model P, it is conditioned on a realization of the data. This measure is called the posterior. Using a posterior to draw statistical conclusions is referred to as Bayesian inference.

We begin with formalizing the data. The data consists of either one value X or multiple values X(n), taking values in a measurable space (X , F ) or

the n-fold product of a measurable space (X , F ), called the sample space. Formalizing the other concepts is more involved.

The model we use throughout this thesis is known as a statistical model. Given the sample space (X , F ), we look at all possible measures on this measurable space. A collection of these measures is referred to as a statistical model.

Definition 1.1. Given the data X from a measurable space (X , F ), a (sta-tistical) model is a collection P of probability measures on (X , F ).

In some cases the model can be parametrized with a parameter from a param-eter space, say θ ∈ Θ; then the model can be denoted by P = {Pθ : θ ∈ Θ}.

The mapping θ → Pθ has to be one-to-one, in which case we say that P

is identifiable. In this thesis every model is parametrized, therefore we do need to check identifiability. A distinction is made between parametric and nonparametric models: in a parametric model the indexing parameter θ lies in a finite-dimensional spaces and in a nonparametric model θ lies in an infinite-dimensional parameter spaces.

We look at an example of both a parametric and of a nonparametric model. Note that, given a topological space (X , T ), the Borel σ-algebra is defined as the smallest σ-algebra containing all open sets. We denote the Borel σ-algebra of a space X by B(X ).

(9)

Example 1.2. Let the data X be defined on the measurable space (R, B(R)). Let the model equal the collection of all normal distributions P = {Nµ,σ2 :

θ = (µ, σ), µ ∈ R, σ ∈ R>0}. Then P is a parametric model as θ = (µ, σ) is

a finite dimensional vector.

We have to show that the model is identifiable: f(µ1,σ1)(x) = f(µ2,σ2)(x) has

to imply µ1, σ1 = (µ2, σ2) for each x, µ ∈ R and σ ∈ R>0.

f(µ1,σ1)(x) = f(µ2,σ2)(x) 1 σ1 √ 2πe −(x−µ1)2 σ2 1 = 1 σ2 √ 2πe −(x−µ2)2 σ2 2 1 σ1 √ 2πe −(x−µ1)2 σ2 1 = 1 σ2 √ 2πe −(x−µ2)2 σ2 2 − log σ1− (x−µ1)2 σ2 1 = − log σ2− (x−µ2)2 σ2 2

We recognize a second-order polynomial for x:

x2 1 σ2 1 − 1 σ2 2  + 2x µ2 σ2 2 − µ1 σ2 1  +µ 2 1 σ2 1 −µ 2 2 σ2 2 + log σ1− log σ2 = 0

As this equation has to hold for each x all coefficients have to equal zero. As σ > 0, 1/σ2

1 − 1/σ22 = 0 implies that σ1 = σ2. Combining this with

µ2/σ22− µ1/σ21 = 0, it follows that µ1 = µ2.

Example 1.3. Let the data X be defined on the measurable space (R, B(R)). Let the model P equal the collection of all probability measures on R. Then, the model is identifiable (as θ → Pθ is the identity function) and P is

non-parametric. Note that P contains the model studied in example 1.2.

In the rest of this thesis we do not parametrize by θ, but we let the model P equal the collection of all probability measures on (X , F ), as in the previous example. We continue by studying the prior. In the Bayesian view the data X has a random measure ˜p. The random measure ˜p a random variable on (P, B). The probability measure Π of ˜p, called the prior. We summarize this as follows,

X|˜p ∼ ˜p with ˜p ∼ Π. (1.1)

(10)

mea-1.2

Bayesian definitions

We define the prior and posterior. We then look at related definitions and see that Bayesian inference is a method to obtain the posterior from the prior Π. We also study the prior predictive and the posterior predictive distributions. To define the posterior we first look at a specific probability measure on ((X × Θ), σ(F × B)), which gives rise to the model distributions, the prior and the posterior among others. These definitions are not straightforward, because instead of a conditional probability on just the sample space (X , F ), we consider the product space X × Θ with product σ-algebra G = σ(F × B). Given the product space (X ×Θ), with the data X and ˜p be random variables as in equation (1.1), let a joint probability distribution for (X, ˜p) be given by

Π∗ : G → [0, 1], (1.2)

where Π∗ is not a product measure. The probability measure Π∗ provides a joint distribution for (X, ˜p), where X is the observation and ˜p the random variable associated with the parameter of the model.

The conditional distribution for X|˜p describes the distribution of Y , given the parameter θ. By looking at the conditional distribution of the data X given the parameter ˜p = θ we can identify the elements Pθ of the model P.

Definition 1.4. The distribution of the data X conditional on the parameter ˜

p is a map,

Π∗X| ˜p : F × Θ → [0, 1], called (a version of) the model distributions.

This definition leaves room for problems, described in Appendix B. To avoid these problems, one can make sure that Π∗X| ˜p is a regular conditional proba-bility (Definition B.2). By assuming that (X , F ) is a standard Borel space, the existence of a regular conditional probability Π∗X| ˜p is guaranteed by the-orem B.3 almost surely up to the marginal distribution of ˜p. This marginal distribution is known as the prior.

Definition 1.5. The marginal distribution Π : B → [0, 1] of the joint distri-bution Π∗ in equation (1.2) for ˜p is referred to as the prior (distribution).

(11)

The prior is a probability distribution on the model before any data is known. It represents a “degree of belief,” a measure that assigns probability to sub-sets of Θ. There are roughly two perspectives on assigning this probability: according to an objectivist the prior has to be assigned as objectively as possible, where a subjectivist lets the prior correspond to a personal belief. The other marginal distribution for X is known as the prior predictive dis-tribution, denoted by PΠ. The prior predictive distribution can be written

as a mixture of model and prior.

Definition 1.6. The marginal distribution of the joint distribution Π∗ in equation (1.2) for X is referred to as the prior predictive (distribution). Given A ∈ F , model distributions Pθ and a prior Π the prior predictive

distribution can be expressed as PΠ(A) =

Z

Θ

Pθ(A)dΠ(θ). (1.3)

When the data is observed we incorporate it into our “belief” for the prior, which then transitions from a “prior belief” into a “posterior belief.”

Definition 1.7. The distribution of the random variable ˜p conditional on the data X is a map,

Π∗p|X˜ : B × X → [0, 1]

called (a version of) the posterior (distribution). Given any B ∈ B and a realization of the data X = x, the notation for the posterior Π∗p|X˜ (B, x) is shortened to Π(B|X = x).

This definition leaves room for problems, described in Appendix B. To avoid these we require that the posterior is a regular conditional probability. It is then almost surely well-defined with respect to the prior predictive PΠ.

We will choose the model P = {Pθ : θ ∈ Θ} and Π, after which the measure

Π∗ arises as, for A ∈ F and B ∈ B, Π∗(A × B) =

Z

B

Pθ(A)dΠ(θ). (1.4)

In this case one does not condition to obtain model distributions and regu-larity does not play a role. If P∗(A) > 0 and Π∗(B) > 0, the posterior, prior

(12)

With a property that is referred to as a disintegration of the joint measure Π∗ on model times sample space (the product space), it follows that if PΠ(B) >

0, we have for the posterior distribution RAΠ∗p|X˜ (B, x)dPΠ(x) = Π(A, B).

Recall equation (1.4) so that we obtain for all A ∈ F and B ∈ B, Z A Π(B|X = x)dPΠ(x) = Z B Pθ(A)dΠ(θ). (1.6)

We can instead use the above equation as a way to define the posterior. Assuming that the model P is dominated, there exists an expression for the posterior in terms of model densities that satisfies the equation.

Lemma 1.8. Assume that the model P = {Pθ : θ ∈ Θ} is well-specified and

dominated by a σ-finite measure µ on (X , F ) with densities pθ = dPθ. Then,

for all B ∈ B, the posterior expressed as, Π(B|X = x) = R Bpθ(x)dΠ(θ) R Θpθ(x)dΠ(θ) , (1.7) satisfies equation (1.6).

Proof. Since the model is dominated, the prior predictive has a density with respect to µ. By applying Fubini’s theorem and the Radon-Nikodym theo-rem, we can rewrite the prior predictive for every A ∈ F ,

PΠ(A) = Z Θ Pθ(A)dΠ(θ) = Z Θ Z A pθ(x)dµ(x)dΠ(θ) = Z A Z Θ pθ(x)dΠ(θ)dµ(x).

So we can conclude that the prior predictive has a density function with respect to µ equal to RΘpθ(y)dΠ(θ). Let A ∈ F and B ∈ B be given.

Substituting (1.7) in equation (1.8), using the density of the prior predictive and applying Fubini’s theorem yields the desired result:

Z A Π(B|X = x)dPΠ(x) = Z A R Bpθ(x)dΠ(θ) R Θpθ(x)dΠ(θ) dPΠ(x) = Z A Z B pθ(x)dΠ(θ)dµ(x) = Z B Pθ(A)dΠ(θ).

When the model and prior are specified we can use equation (1.6) as the defining property of the posterior.

(13)

Definition 1.9. Given a model and a prior, any map π : B × X → [0, 1], such that x → π(B, x) is measurable on (X , F ) for all B ∈ B and that the following equation holds for each A ∈ F and B ∈ B,

Z A π(B, x)dPΠ(y) = Z B Pθ(A)dΠ(θ), (1.8)

is called a version of the posterior. The map π : B × X is unique up to null-sets of PΠ.

This definition again leaves room for problems, the posterior has to be defined by a regular conditional probability. However, Definition 1.9 does not imply the second property of B.2. To avoid these problems we let the posterior distribution be defined by a regular conditional probability. It is almost surely defined with respect to the prior predictive distribution PΠ. We will

see the consequence of this when we give a frequentist view on Bayesian inference. Note that one can show that the version of the posterior in lemma 1.8 is regular.

As the posterior distribution is a distribution on Θ given the data X, we can define another marginal distribution for X.

Definition 1.10. Let the posterior be as in Definition 1.9. The posterior predictive (distribution) is a data-dependent map, defined by

ˆ

P (X ∈ A) = Z

Θ

Pθ(A)dΠ(θ|X). (1.9)

The posterior predictive is defined uniquely for any A ∈ F up to null-sets of PΠ, since the posterior Π(·|X) is defined up to null-sets of PΠ.

As we define the posterior predictive distribution, it is not immediately clear that ˆP is actually a probability measure.

Lemma 1.11. The posterior predictive distribution, induced by ˆP : F → [0, 1], is a probability measure PΠ-almost surely.

Proof. As ˆP is defined PΠ-almost surely, there is a set E ∈ F with PΠ(E) = 1

(14)

ˆ P [ i≥1 Bi ! = Z Θ P [ i≥1 Bi ! dΠ(θ|X = x) = Z Θ X i≥1 P (Bi)dΠ(θ|X = x).

Upon applying Fubini’s theorem, since (θ, i) 7→ P (Bi) is a non-negative

mea-surable function, we obtain the desired result:

Z Θ X i≥1 P (Bi)dΠ(θ|X = x) = X i≥1 Z Θ P (Bi)dΠ(θ|X = x) = X i≥1 ˆ P (Bi).

1.3

A frequentist view on Bayesian inference

Now that the concept of Bayesian inference is clear for one variable X, we can look at applying Bayesian inference for n variables, X(n) = X

1, . . . , Xn. This

way we have for each n variables a posterior distribution Π(·|X(n)), hence the convergence of a sequence of posterior distributions can be studied. Contrary to the Bayesian assumption that the data has a random measure, we make the frequentist assumption that the data has a true probability distribution P0. We still apply Bayesian inference as if the data would have

a random measure. This way, we can see if the posterior converges to the true distribution (in a suitable way). It is said that we apply Bayesian inference from a frequentist perspective. In particular we are interested in which cases the posterior converges to the true distribution P0. Then we will say that

the posterior is consistent in chapter 3 we formalize this with definition 3.5. To study the limit behavior of probability distributions we use weak conver-gence.

Definition 1.12. Let Θ denote the set of all possible probability measures on a measurable space (X , F ). A sequence (Pn)n≥1 of probability measures

converges weakly to P , denoted by Pn P , if and only if for all bounded

continuous functions f : X → R the following limit holds, Z

f dPn →

Z

(15)

Recall that a function f : X → Y , where (X, dX) and (Y, dY) are metric

spaces, is called Lipschitz continuous if there exists a constant K ≥ 0 such that for all x, y ∈ X: dY(f (x), f (y)) ≤ KdX(x, y) and that Lipschitz

(con-tinuous) functions are uniformly continuous. An equivalent definition for 1.12 exists for weak convergence with bounded Lipschitz functions instead of bounded continuous functions by the Portmanteau theorem [5] (Theorem 11.3.3): a sequence of probability measures Pn P if and only if for all

bounded Lipschitz functions f : R f dPn→R f dP as n → ∞.

Let the data X be distributed as P0. Recall that if we would define a model

and a prior, the posterior is defined uniquely up to null-sets of the prior pre-dictive distribution PΠ. From our frequentist point of view we want to draw

statistical conclusions for the posterior based upon the data. Statements for the posterior are only P0 measurable, if the posterior is defined P0-a.s.

Suppose PΠ(A) = 0 for a set A ∈ F , while P

0(A) > 0. Then the posterior

can have any value on A. (As a result it would not make sense to talk about convergence of the posterior.) Therefore we require that any null-set of PΠ is a null-set of P0. We say that PΠ dominates P0 (P0  PΠ).

In many settings the data consists of n i.i.d. random variables with mea-sure P0. Then we denote the data by X(n) on the n-fold product space of

(X , F , P0). We will state the corresponding model distributions, prior

predic-tive distribution, posterior and posterior predicpredic-tive distribution. The prior Π from Definition 1.5 remains unchanged. Let the sample space be denoted by (Xn, Fn, Pn

0) and the parameter space by (Θ, B, Π).

The model distributions are for all (A1, . . . , An) ∈ Fn and θ ∈ Θ defined by

the regular conditional probability:

Π∗X| ˜p((X1 ∈ A1, . . . , Xn∈ An)|˜p = θ) = n

Y

i=1

Pθ(Ai). (1.11)

The prior predictive distribution is induced by the following measure for all (A1, . . . , An) ∈ Fn, PnΠ(X1 ∈ A1, . . . , Xn∈ An) = Z Θ n Y i=1 Pθ(Ai)dΠ(θ). (1.12)

(16)

The posterior is defined up to null-sets of PnΠ by a regular conditional prob-ability that is a solution to Bayes’ theorem for all (A1, . . . , An) ∈ Fn and

B ∈ B in the form of: Z (A1,...,An) Π(B|X1 = x1, . . . , Xn= xn)dPnΠ(x1, . . . , xn) = Z B n Y i=1 Pθ(Ai)dΠ(θ). (1.13) The condition P0  PΠ becomes P0n PnΠ. Finally we look at the posterior

predictive distribution, it is induced for any A ∈ F by the measure: P (Xn+1 ∈ A|X(n)) =

Z

Θ

Pθ(A)dΠ(θ|X(n)). (1.14)

The posterior predictive distribution can be used as a prediction for the probability distribution of Xn+1. Note that the random measure ˜p always

has the distribution equal to the prior. So based upon the data X(n), ˜p has

measure Π(·|X(n)) on (Θ, B). Combining this with equation (1.14) yields, E[˜p(A)|X(n)] =

Z

Θ

Pθ(A)dΠ(θ|X(n)) = P (Xn+1 ∈ A|X(n)). (1.15)

Similarly we see that the second moment of ˜p equals the posterior predictive distribution for Xn+2 and Xn+1:

E[˜p(A)2|X(n)] = Z

Θ

Pθ(A)2dΠ(θ|X(n)) = P (Xn+1∈ A, Xn+2∈ A|X(n)).

(1.16) Now that we have set up an adequate Bayesian framework we sketch its use in the following chapters. We will use some concepts that are not yet explained, these will be introduced in the corresponding chapters.

In the next chapter we study the posterior predictive distribution P (Xn+1∈

A|X(n)) for a prior based on a species sampling sequence. Then P (Xn+1 ∈

A|X(n)) can be expressed as formula depending on X(n) and the chosen

species sampling sequence. This formula for the conditional distribution of Xn+1 based upon X(n) is called a prediction rule.

In the third chapter we study the limit behavior of the posterior distribution based upon X(n). If the posterior converges to the true distribution (in a suitable way), we say that the posterior is consistent. Moreover, if the posterior is consistent the variation of ˜p converges to zero. We show this by studying the limit behavior of the prediction rule and applying equalities (1.15) and (1.16).

(17)

1.4

Applications of Bayesian inference

We will study two applications of Bayesian inference, one in species sampling and another in mixture modeling.

1.4.1

Species sampling

In species sampling Gibbs-type priors and other nonparametric priors are used. A series of observations X(n) := X1, . . . , Xn generated by a

nonpara-metric prior is interpreted as a random sample drawn from a large population of various species. Then, the next observation, Xn+1, equals a previously

ob-served species with a positive probability. The method was introduced by Pitman [14].

Statistical applications that use species sampling are based on a sample of species labels X(n) and study the statistical properties of Xn+1, . . . , Xn+m.

These quantities of interest are for example: the number of new distinct species that will be detected in the new sample of size m, the frequency of species detected in a sample size m or the probability that the (n + m + 1)-th draw will consist of a species having frequency l ≥ 0 in ˜Xn+1, . . . , ˜Xn+m. By

studying these problems on different sample sizes, measures are generated for overall and rare species diversity, which are of interest in biological, ecological, linguistic and other studies, as described in [2].

We generate a sample of observations using a Dirichlet process. We have used this Dirichlet process also in the introduction, a formal definitions can be found in the next chapter. Let there be n previous observations, X(n),

with k distinct values, X1∗, . . . , Xk∗. Recall that Xn+1 can be generated with a

measure given by the prediction rule of the Dirichlet process with parameters α and P∗ (denoted by DP (α, P∗)), P (Xn+1∈ ·|X(n)) = α α + nP ∗ (·) + k X i=1 ni α + nδXi∗(·). (1.17)

For example we look at the probability of observing a new species with a sample size of n using any Dirichlet process. We immediately see that, for a sample of size n the probability is given by α/(α + n).

(18)

5 10 15 20 k 0.05

0.10 0.15

PHkL

Figure 1.1: A random sample from a data set with 50 values is used to plot the probability of a new observation, calculated with a Dirichlet process DP (10, P∗). The different observations are numbered k = 1, . . . , 20 and a new observation equals observation k with probability P (k). Note that the probability of observing a new value is P (21) = 1/6 and that P∗ does not play a role, as P∗ is diffuse.

1.4.2

Mixture modeling

Mixture models are used in density estimation, clustering and also in more complex structures. Mixture models do not always use Bayesian inference, since each mixture model has both a frequentist and a Bayesian approach. The goal in a mixture model is to estimate a probability density function (PDF). We assume that the data X(n) is defined on the measurable space

(R, B(R)) equipped with the Lebesgue measure. We assume that each Xi

has a random probability distribution f conditioned on a random variable Yi, defined on a measurable space (Y, G). We model the Yi with a non

parametric prior. The distribution of Xi depends on a discrete value Yi often

the parameter of the PDF of Xi, making this a useful technique to estimate

many cases where each Xi has the same mixture distribution. The above

amounts to the following equations for i = 1, . . . , n: Xi|Yi ∼ f (·|Yi), independently,

(19)

We omit some mathematical details and continue with an intuitive explana-tion how the mixture modeling works. We express the prior predictive by the following integrals for any x ∈ X and A ∈ G:

P (Xi = x|Yi = y) =

Z

A

f (x|Yi = y)dP (Yi = y).

For Yi we recognize the familiar prior predictive distribution,

PΠ(Yi ∈ A) =

Z

Θ

Pθ(A)dΠ(θ).

We continue by letting the Xi be generated by a Dirichlet process (for more

information see 2.12). This model, the Dirichlet process mixture model, was introduced by Lo [13].

To make sure that the procedure works, let f (·|·) denote a kernel on (R, Y) such that R

Xf (x|y)dx = 1, for any y ∈ Y. (So f (x|y) defines a probability

density function on X , for every y ∈ Y.)

Using the Dirichlet process mixture model with Π = DP (α, P∗), we will intuitively derive the formula for the expected value given the data,

E[f (x|Yn+1)|X(n)]. (1.18)

Consider a partition P of the data X(n), where the parts consist of the similar

values Xi (sorted by ties). Let the probability of a partition P be denoted by

W (P) and let Ci denote the i-th set of the partition and let the realizations

in Ci be abbreviated by XCi, Ef (x|Yn+1)|X(n)  = Z Y f (x|y) dP (Yn+1 = y|X(n)) = Z Y f (x|y) d X P α α + nP ∗ (y) + k X i=1 ni α + nE[δYi∗(y)|XCi|P] ! W (P) ! = Z Y f (x|y)X P W (P) α α + n dP ∗ (y) + k X i=1 ni α + n dE[δYi∗(y)|XCi] ! .

(20)

Applying the definition of a conditional probability and the law of total probability yields dEδYi∗(y)|XCi = dP (Y ∗ i = y|XCi) = dP (XCi|Y ∗ i = y)P (Y ∗ i = y) P (XCi) = dP (XCi|Y ∗ i = y)P (Y ∗ i = y) R YdP (XCi|Y ∗ i = y)P (Y ∗ i = y) = Q j∈Cif (Xj, y)dP ∗(y) R Y Q j∈Cif (Xj, y)dP ∗(y).

Let (α)nequal the ascending factorial α(α+1) · · · (α+n−1). The probability

of observing a partition P generated by the Dirichlet process W (P) is given by: W (P) = Pφ(P) Pφ(P) with φ(P) = α kQk i=1(ni− 1)! (α)n . (1.19)

For now we will use it without explanation, an exact formula is explained later at equation (2.8). As is formally proved in Lo [13], we find an expression for (1.18) E[f (x|Yn+1)|X(n)] = α α + n Z Y f (x|y)dP∗(y) +X P W (P) k X i=1 ni α + n R Yf (x|y) Q j∈Cif (Xj, y)dP ∗(y) R Y Q j∈Cif (Xj, y)dP ∗(y) .

As we find an expression for E[f (x|Yn+1)|X(n)], we can make a grid point

estimation of the PDF.

To conclude, we make this theory more concrete by applying it to an example. Let the true distribution P0 be given by a standard normal distribution and

we use four values {1.2110, 0.6352, −0.7695, −1.3972} that are sampled from this distribution. The kernel we use is f (x|y) = √1

2πe

−1/2(x−y)2

, so f (x|y) is a density function of a normal distribution with mean y and standard deviation 1. We use a mesh from -8 to 8 with a step size of -.25 to plot the conditional expectations. First we plot P0.

(21)

-5 5 x 0.1 0.2 0.3 0.4 fHxL

Figure 1.2: The PDF of a standard normal probability distribution that will be estimated using mixture modeling.

Suppose that we only have a small sample but all values are between −4 and 4. Then, a logical choice would be to let the mean vary uniformly between −4 and 4: so we let P∗ = U [−4, 4]. Based on these assumptions we can plot

the prior and the posterior distributions.

-5 5 x 0.02 0.04 0.06 0.08 0.10 0.12 fHxL

Figure 1.3: The prior predictive distribution of X, where f (x) = E[f (x|Y )], f (x|y) = 1/√2πe−1/2(x−y)2 and Y ∼ U [−4, 4].

(22)

-5 5 x 0.05 0.10 0.15 0.20 fHxL -5 5 x 0.05 0.10 0.15 0.20 0.25 fHxL -5 5 x 0.05 0.10 0.15 0.20 0.25 fHxL -5 5 x 0.05 0.10 0.15 0.20 0.25 fHxL

Figure 1.4: Four posteriors predictive distributions of X, with a sample {X1, X2, X3, X4} = {1.2110, 0.6352, −0.7695, −1.3972}. Top left is based

on {X1}, top right on {X1, X2}, bottom left on {X1, X2, X3}, and bottom

right is based on all four samples {X1, X2, X3, X4}. Mixture modeling is

used to obtain the posterior with f (x) = E[f (x|Y )|X1, X2, X3, X4], f (x|y) = 1

√ 2πe

−1/2(x−y)2

and Y ∼ U [−4, 4]

Looking at the figures one sees that the estimates converges to the true probability distribution. Unfortunately this procedure is very computation-ally intensive (calculating the partitions is of factorial time). However, faster procedures do exist [12].

Note that instead of a Dirichlet process one can generalize the method such that the Yi are generated by a Gibbs-type process or any other species

(23)

Chapter 2

Gibbs-type priors

We lay down the mathematical foundation to define and study Gibbs-type priors and other discrete nonparametric priors. We characterize a Dirichlet process prior, a Pitman-Yor process prior and a Gibbs-type prior by a pre-diction rule. A species sampling sequence is defined by such a prepre-diction rule and a species sampling sequence is a stochastic process that characterizes the stochastic process priors.

To define a species sampling sequence, we first introduce random partitions. On these partitions we define an exchangeable partition probability function. We use this function to define the prediction probability function and with this prediction probability function we obtain a prediction rule.

2.1

Random Partitions

Given the finite set Nn:= {1, . . . , n}, consider any series of disjoint nonempty

subsets Ai ⊂ Nn with i = 1, . . . , k and ∪iAi = Nn. We call {A1, . . . , Ak}

a partition of Nn. Let the cardinality of Ai be denoted by ni. The set

{n1, . . . , nk} of natural numbers is said to be a partition of n. We order this

partition (n1, . . . , nk) to obtain a composition of n. We write Cn for the set

of all compositions of n and define a random partition in the following way. Definition 2.1. Given a finite set (Ω, F , P ), a random partition is a random variable Pn: Ω → Cn.

(24)

Definition 2.2. A random partition Pn, as in Definition 2.1, is called

ex-changeable if there exists a symmetric function p : Cn → [0, 1], such that for

every partition {A1, . . . , Ak} of Nn with |Ai| = ni for i = 1, . . . , n:

P (Pn= {A1, . . . , Ak}) = p(n1, . . . , nk).

The function p is referred to as the exchangeable partition probability func-tion (EPPF) for Pn. We abbreviate p(n1, . . . , nk) to p(n).

We proceed by studying a partition structure, which links exchangeable ran-dom partitions across n. A partition structure consists of an exchangeable random partition for each n.

By adding the n+1-th element to Pn, we embed Pn in Pn+1. Several ways

of doing this are possible. Consider the EPPF for the exchangeable random partition Pn = {A1, . . . , Ak}. If the n+1-th element is added to the Ai-th

set, we denote the EPPF for Pn+1 by p(n1, . . . , ni + 1, . . . , nk) = p(ni+).

If the n+1-th element forms a new subset Ak+1 we denote the EPPF by

p(n1, . . . , nk, 1) = p(nk+1+). We abuse the notation by using the same letter

p for both the EPPF for Pnand Pn+1, however the argument reveals to which

EPPF p refers.

We require these partition structures to satisfy a specific relation. Since an EPPF of a random partition Pnequals a probability, leaving out element n+1

in Pn+1 (and removing possible empty sets) should equal the probability of

Pn. This translates to the following relation for the EPPF of Pn and Pn+1,

p(n) =

k

X

i=1

p(ni+) + p(nk+1+). (2.1)

If a partition structure satisfies this relation for each EPPF, then the partition structure is called consistent.

Definition 2.3. Let Pn = {A1, . . . , Ak} be an exchangeable random

parti-tion of Nn. If leaving out elements {m + 1, . . . , n} of the sets {A1, . . . , Ak}

and removing possible empty sets results in an exchangeable random parti-tion Pm of Nm, Pn is called consistent.

We call such a consistent partition structure an infinite exchangeable random partition.

Definition 2.4. An infinite exchangeable random partition is a consistent sequence of random partitions (Pn: n ∈ N).

(25)

We will consider for a stochastic process the probability of observing a new value or a different value. For infinite exchangeable random partitions these probabilities are given by the the prediction probability function, defined below.

Definition 2.5. Given an EPPF p : ∪∞n=1Cn → [0, 1] of a consistent infinite

exchangeable random partition, the prediction probability function (PPF) is defined as

pi(n) =

p(ni+)

p(n) . (2.2)

By rewriting we see that equation (2.2) inductively specifies the EPPF of an infinite exchangeable random partition: p(ni+) = p

i(n)p(n). In fact, we can

specify an infinite exchangeable random partition by a collection of functions that satisfy two conditions.

Lemma 2.6. A collection of functions pi is a PPF of an infinite exchangeable

partition if and only if (p1(n), . . . , pk+1(n)) is a probability vector and for all

i, j = 1, 2, . . . , k + 1,

pi(n)pj(ni+) = pj(n)pi(nj+). (2.3)

Proof. First we show that for a collection of functions (pi)i≥1as in Definition

2.5, (p1(n), . . . , pk+1(n)) is a probability vector and equation (2.3) holds.

By equation (2.1), 1 = Pk+1

i=1 p(ni+)

p(n) = 1, combined with the fact that each

pi(n) ∈ [0, 1], it follows that (p1(n), . . . , pk+1(n)) is a probability vector.

Applying equation (2.2) twice yields p((nj+)i+) = p

i(nj+)pj(n)p(n) and

p((ni+)j+) = pj(ni+)pi(n)p(n). As p((nj+)i+) = p((ni+)j+), it also holds

that pi(n)pj(ni+) = pj(n)pi(nj+).

Conversely, we can define the EPPF given the PPF. Let p(1) := 1, p(2) = p1(1) and p(1, 1) := p2(1), it is clear that this p is an EPPF for P1 and

for P2. We define the EPPF for Pn with n ≥ 3 by equation (2.2), letting

p(ni+) = pi(n)p(n). We show by induction that this is in fact yields an EPPF

for Pn+1. i+

(26)

n+1-th element is removed. Let the n+1-th element of {A1, . . . , Ak} be in

Ai: P (Pn({A1, . . . , Ai\ {n + 1}, . . . , Ak})) = p(n1, . . . , nl) and let the n+1-th

element of {A∗1, . . . , A∗k} be in A∗ j: P (Pn= ({A∗1, . . . , A ∗ j\ {n + 1}, . . . , A ∗ k}) = p(n1, . . . , nq).

We remove a specific element from both random partitions Pn, in order to

use the symmetry of the EPPF for Pn−1. Consider aj ∈ Aσ−1(j) such that

|Aσ−1(j)\ {aj}| = |A∗j\ {n + 1}| and a∗i ∈ A∗

σ(i) such that |A ∗ σ(i)\ {a ∗ i}| = |Ai\ {n + 1}|, then P (Pn= ({A1, . . . , Ai\ {n + 1}, . . . , Aσ−1(j)\ {aj}, . . . , Ak})) = P (Pn = ({A∗1, . . . , A ∗ j \ {n + 1}, . . . , A ∗ σ(i)\ {a ∗ i}, . . . , A ∗ k})) = p(n1, . . . , nm).

Using equation (2.3) we obtain the result,

P (Pn = {A1, . . . , Ak}) = pi(n1, . . . , nl)p(n1, . . . , nl) = pi((n1, . . . , nm)j+)pj(n1, . . . , nm)p(n1, . . . , nm) = pj((n1, . . . , nm)i+)pi(n1, . . . , nm)p(n1, . . . , nm) = pj(n1, . . . , nq)p(n1, . . . , nq) = P (Pn = {A∗1, . . . , A ∗ k}).

A famous example of an infinite exchangeable random partition is the Chinese restaurant process.

Example 2.7. The Chinese restaurant process owes its name to the following metaphor. Consider a Chinese restaurant in which customers arrive one at a time. The first customer sits at any table, this table is labeled Table 1. The next customer chooses to sit either at table 1 with probability α/(α + 1), where α > 0, or at a new table, which will be labeled Table 2, with probability 1/(α + 1). In the general case, customer n + 1 finds upon arrival k tables occupied with ni customers at each table i, with i = 1, . . . , k. He chooses

to sit at a new table with probability α/(α + 1) and will sit at table i with probability ni/(α + n). In this scheme customers will be more likely to sit

at larger tables, this gravitational effect can be used in some applications to cluster variables.

A PPF is described by the Chinese restaurant process. Let α > 0 and let n = (n1, . . . , nk) be the cardinalities of the Ai from a partition {A1, . . . , Ak},

a previous value i is observed with probability: pi(n) =

ni

(27)

and a new value is observed with probability: pk+1(n) =

α

α + n. (2.5)

Using lemma 2.6, we can show that this collection of functions is in fact a PPF. Clearly (p1(n), . . . , pk+1(n)) is a probability vector: each pi(n) ≥ 0

and Pk+1

i=1 pi(n) = α+nα +

Pn

i=1 ni

α+n = 1. Also, for all i, j = 1, 2, . . . , k + 1,

the equality pi(n)pj(ni+) = pj(n)pi(nj+) has to hold as well. We distinguish

3 cases. If both i and j are equal to a previous value, both sides equal ninj/((α + n)(α + n + 1)). If either i or j is a new value, without loss of

generality say i is the new value, both sides equal αni/((α + n)(α + n + 1)). If

both i and j are new values, both sides are equal to α2/((α + n)(α + n + 1)).

2.2

Species Sampling Sequences

In this section we define a species sampling sequence by a prediction rule and study the random measure ˜p of a species sampling sequence. We will use theorems in which a finite or infinite sequence of the data X1, X2. . . is

exchangeable, we define such a sequence below.

Definition 2.8. A finite or infinite sequence of random variables X1, X2, . . .

is called exchangeable, if for any finite permutation of n elements, the joint distribution of Xσ(1), . . . , Xσ(n) is the same as the distribution of X1, . . . , Xn.

We assumed the data (Xn)n≥1 to be i.i.d.-P0. Clearly this implies that the

data is exchangeable.

Lemma 2.9. Let the data X1, X2, . . . be i.i.d.-P0 on (X , F ), then the data

is exchangeable.

Proof. Let σ be any permutation on Nn, for any (x1, . . . , xn) we have to

check if P (Xσ(1) ≤ x1, . . . , Xσ(n) ≤ xn) = P (X1 ≤ x1, . . . , Xn ≤ xn). By

independence of the Xi,

P (Xσ(1)≤ x1, . . . , Xσ(n) ≤ xn) = P (Xσ(1) ≤ x1) · · · P (Xσ(n) ≤ xn)

= P (X1 ≤ x1) · · · P (Xn ≤ xn)

(28)

It is interesting to note that for any exchangeable sequence of random vari-ables X1, X2, . . . the de Finetti representation theorem [9] (Theorem 12.3)

implies the existence of a random measure ˜q with distribution Q(θ) on (Θ, B) such that for X1, . . . , Xn and Ai ∈ F for i = 1, . . . , n:

P (X1 ∈ A1, . . . , Xn ∈ An) = Z Θ n Y i=1 Pθ(Ai)dQ(θ).

We define a species sampling sequence. Let the variable κn denote the

num-ber of different values given n observations. Recall that δx denotes a Dirac

measure.

Definition 2.10. A species sampling sequence is a sequence (Xi)i≥1 subject

to a prediction rule of the form

P (Xn+1 ∈ A|X(n)) = 1 − κn X i=1 pi(n) ! P∗(A) + κn X i=1 pi(n)δX∗ i(A), (2.6)

where P∗ is a diffuse probability measure on (X , F ) (P∗(x) = 0 for all x ∈ X ) called the prior guess and (pi(n))i≥1 is a PPF.

A species sampling sequence as above characterizes a stochastic process, as described by Hansen and Pitman [8] (Theorem 2).

Theorem 2.11. Let P∗ be a prior guess on a measurable space (X , F ). A prediction rule as in definition 2.10 defines an exchangeable sequence of random variables (Xn)n≥1. Such a sequence can be constructed by generating

new values i.i.d.-P∗ and letting Xn+1 have distribution P (Xn+1 ∈ ·|X(n)),

described by the prediction rule.

The stochastic process also characterizes a stochastic process prior ˜p with a distribution Π(·|X(n)) on (Θ, B). It is common practice to use without a

proof that the process prior ˜p is well defined, see [3] and [6], although it is suggested that Kolmogorov’s consistency theorem can be used to show such a result. However showing this is beyond the scope of this thesis and a possible subject for further research. It seems that only if every possible outcome of P0 is a possible outcome of the prior guess P∗ (the support of P0 is a subset

(29)

2.3

Discrete Nonparametric Priors

We define the Dirichlet process, Pitman-Yor process and the Gibbs-type (pro-cess) priors by the PPF and thus by a prediction rule as described in theorem 2.11.

Definition 2.12. The distribution Π(·|X(n)) on (Θ, B) of the random

mea-sure ˜p induced by the PPF of example 2.7, described by equation (2.4) and equation (2.5) with an α > 0, and a diffuse probability measure P∗on (X , F ) is called a Dirichlet process prior, denoted by DP (α, P∗).

The posterior predictive of a Dirichlet process prior is for any A ∈ F as follows, P (Xn+1 ∈ A|X(n)) = α α + nP ∗ (A) + κn X i=1 ni α + nδXi∗(A). (2.7)

Note that a generated sequence X1, X2, . . . generated by a species sampling

sequence with a random measure ˜p as in definition 2.12 is a Dirichlet process, see [14].

In the first chapter we used an expression for observing a partition P gener-ated by a Dirichlet process that is sorted by ties. The φ(P) in 1.19 can be calculated with the PPF. Consider the data X(n)with k distinct observations

X1∗, . . . , Xk∗, we sort by ties and translate this to the probability of observing a partition P with the PPF of example 2.7. A new value is k − 1 times observed, X1∗ is n1− 1 times observed, X2∗ is n2− 1 times observed etc. This

is a series of probabilities that we denote by φ(P),

φ(P) = α α + 1· · · α α + k − 1 × 1 α + k· · · n1− 1 α + k + n1− 1 · · · nk− 1 α + n − 1 = α kQk i=1(ni− 1)! α(n) .

Normalizing yields the probability of observing a specific partition W (P): W (P) = P P (Pn= P)

P∈AP (Pn = P)

(30)

Consider the PPF defined for α > 0 and σ ∈ [0, 1) and θ > −σ or σ < 0 with θ = m|σ| for some positive integer m. Let n = (n1, . . . , nk) be a composition

of n, a previous value i is observed with probability: pi(n) =

ni− σ

α + n and a new value is observed with probability,

pk+1(n) =

α + σk α + n .

Note that (p1(n), . . . , pk+1(n)) is a probability vector: clearly each pi(n) ≥ 0

and for i = 1, . . . , k + 1, Pn i=1pi(n) = α+σkα+n + Pn i=1 ni−σ α+n = 1. Also, for

all i, j = 1, 2, . . . , k + 1, pi(n)pj(ni+) = pj(n)pi(nj+) has to hold as well.

We distinguish 3 cases. If both i and j are equal to a previous value, both sides equal (ni − σ)(nj − σ)/((α + n)(α + n + 1)). If either i or j is a new

value, without loss of generality say i is the new value, both sides equal (ni− σ)(α + σk)/((α + n)(α + n + 1)). If both i and j are new values, both

sides are equal to (α + σk)(α + σ(k + 1))/((α + n)(α + n + 1)). So by lemma 2.6 this is in fact a PPF.

Definition 2.13. The measure Π on (Θ, B) of the random measure ˜p induced by the above PPF and a diffuse probability measure P∗ on (X , F ) is called a Pitman-Yor process prior.

The posterior predictive of a Pitman-Yor type prior is for any A ∈ F as follows, P (Xn+1∈ A|X(n)) = α + σκn α + n P (A) + 1 α + n κn X i=1 (ni− σ)δXi∗(A). (2.9)

By generalizing the Pitman-Yor process prior we arrive at Gibbs-type priors. Definition 2.14. The random measure obtained with theorem 2.11 by P∗ and a PPF, that is derived from the following EPPF as in theorem 2.11, is called a Gibbs-type prior. The EPPF is defined as follows, where (x)n equals

the ascending factorial x(x + 1) · · · (x + n − 1):

p(n1, . . . , nκn) := Vn,κn

κn

Y

i=1

(1 − σ)ni−1, (2.10)

and σ < 1 and Vn,κn satisfy the backwards recurrence relation:

(31)

The posterior predictive for a Gibbs-type prior is for any A ∈ F as follows: P (Xn+1 ∈ A|X(n)) = Vn+1,κn+1 Vn,κn P∗(A) + Vn+1,κn Vn,κn κn X i=1 (ni− σ)δXi∗(A). (2.12)

In [7] a complete characterization of Gibbs-type processes is given based upon σ. We focus in the next chapter on the case of σ < 0, then

Vn,κn = X x≥κn Qκn−1 i=1 (x|σ| + iσ) (x|σ| + 1)n−1 π(x). (2.13)

where π is any probability measure on N and the sum runs over all integers x ≥ κn and x < κncorresponds to π(x) = 0. We will show that a Gibbs-type

prior with σ < 0 converges weakly in some cases to the true measure δP0 on

(Θ, B).

Similar to [3], we describe a Gibbs-type prior with σ < 0 induced by the prediction rule (2.13) as a mixture model. For such a prior the number of species κn has distribution π(x) and the species X1∗, . . . X

κn have weights

˜

p1, . . . , ˜pκn that are distributed as an x-variate symmetric Dirichlet

distribu-tion Dir(|σ|, . . . , |σ|),

(˜p1, . . . , ˜pκn) ∼ Dir(|σ|, . . . , |σ|)

κn ∼ π

We conclude this section with an example of a Gibbs-type prior with Poisson mixing.

Example 2.15. The Gibbs-type prior with Poisson mixing is characterized by a prediction rule with any prior guess P∗, a σ = −1 and a Poisson mixing distribution π with parameter λ > 0 restricted to the positive integers:

π(x) = e

−λ

1 − e−λ

λx

(32)

Chapter 3

Limit behavior of Gibbs-type

priors

We study the limit behavior of the posterior obtained from a Gibbs-type prior. We assume that the data has a true distribution P0. It is possible

to show that this posterior converges weakly to a unique distribution almost surely. Moreover, in some cases this unique distribution equals the true distribution δP0 almost surely.

First we consider two limit theorems. We proceed with a convergence the-orem for semi-norms. Using these thethe-orems we show a convergence and a consistency theorem for Gibbs-type priors. We also look at the support of Gibbs-type priors.

3.1

Two limit theorems

We will show that the number of different values in a sample X(n), with Xi

i.i.d.-P0, converges. We express the number of different values in a sample

X(n) with κ n, κn:= 1 + n X i=1

1Di−1(Xi), with Di−1= {X1, . . . , Xi−1}

c

(3.1) For the limiting behavior of κn we make a distinction between a discrete

and diffuse P0. Note that for any probability measure P on (X , F ) there

exists a decomposition into a discrete measure P1 and a diffuse measure P2

(because F is a Borel σ-algebra it contains all singletons), such that for any A ∈ F : P (A) = P1(A) + P2(A). The discrete part consists of atoms xi with

(33)

diffuse part it holds that P2(x) = 0 for all x ∈ X . Note that for a discrete

probability measure P1(X ) = 1 and that for a diffuse probability measure

P2(X = 1).

Theorem 3.1. Let κn be as in (3.1) and Xi be i.i.d. random variables on

a measurable space (X , F ) with measure P0. If P0 is a discrete probability

measure

κn

n → 0 a.s.-P

0 , as n → ∞ (3.2)

and when P0 is a diffuse probability measure

κn = n a.s.-P0n. (3.3)

Proof. We will use the following measure theoretic result: if for all  > 0 the series P∞

n=1P (|Xn− X| > ) is convergent, then Xn → X P -a.s. So in the

case where P0 is a discrete measure, if for any  > 0 there exists an N such

that for n > N : P0∞(κn/n > ) = 0, then κn/n → 0 a.s.-P0∞as n → ∞.

As P0 is discrete, we can find m atoms such thatPmi=1Wi > 1−1. Let Xi∗ be

the random values associated with the atomic masses Wi and define ˜κm(x) :=

1{X∗

1,...,Xm∗}c(x). Since any sample X

(n) contains less different observations

than the sample (X1∗, . . . , Xm∗, X(n)),

κn≤ m + n X i=1 1{X∗ 1,...,Xm∗,X1,...,Xi−1}c(Xi) ≤ m + n X i=1 ˜ κm(Xi) a.s.-P0n. (3.4)

Note that E[˜κm(Xi)] =

P

i>mWi ∈ [0, 1] is finite so that by the strong law

of large numbers (LLN), lim n→∞ ∞ X i=1 ˜ κi n = X i>m Wi a.s.-P0∞.

So for each 2 > 0, there exists an integer N (2) > 0, such that for each

n > N2: |

Pn

i=1κ˜i(Xi)/n −

P

i>mWi| < 2 a.s.-P0n. This implies, using the

triangle inequality (|a| = |a − b + b| ≤ |a − b| + |b|),

n Xκ˜m(Xi) ≤  + XW <  +  a.s.-Pn.

(34)

let n > N with N large enough such that N > N (2), m/N < /3 and

|P∞

i=1κ˜i(Xi)/N −

P

i>mWi| < 2. Since the Xi are i.i.d.-P0 and by the

definition of the product measure, P0nκn n >   = P0∞κn n >   . By the previous inequalities {κn/n > } ⊆ {m/n +

Pn i=1κ˜m(Xi)/n > }, P0∞κn n >   ≤ P0∞ m n + Pn i=1κ˜m(Xi) n >   . From the previous choice of n it follows that,

m n + Pn i=1κ˜m(Xi) n <  a.s.-P n 0. So the event {Pn

i=1κ˜m(Xi)/n > } ⊆ { > } and clearly P ∞

0 ( > ) = 0. So

P∞

n=1P (|Xn− X| > ) ≤ N and thus κn/n → 0 a.s.-P0∞.

For any diffuse probability measure P0 the probability P0n(κn = n) = 1.

Suppose not, then for a certain n it holds that P (Xn ∈ {X1, . . . , Xn−1}) > 0

and a contradiction arises as for a certain i ∈ {1, . . . , n − 1}, Pn

0(Xn =

Xi) > 0, but by the assumption of diffuseness P0(x) = 0 should hold for any

x ∈ X .

In the following proofs a weighted empirical measure Pn,κn is used. We show

that it converges weakly to the empirical measure. Pn,κn is defined in the

fol-lowing way. Let there be n observations with κndistinct value: X1∗, . . . , Xκ∗n,

where nidenotes the number of observation Xi∗. Then the weighted empirical

measure Pn,κn is defined as,

˜ Pn,κn := 1 n − κnσ κn X i=1 (ni− σ)δX∗ i. (3.5)

Theorem 3.2. Let X1, X2, . . . be i.i.d. random variables with an either

dis-crete or diffuse probability measure P0 on a standard Borel space (X, F ) and

let ˜Pn,κn be as in (3.5). Then for any A ∈ F

˜

Pn,κn(A) → P0(A) a.s.-P

0 when n → ∞. (3.6)

Proof. Let the empirical measure be denoted by Pn(A) := 1/nPni=1δ{Xi∈A}

and let for any A ∈ F , ¯Pn(A) := 1/(n − κnσ)

Pn

i=1niδXi∗(A). Note that both

(35)

In the case where P0 is a diffuse probability measure, we show for any A ∈ F :

˜

Pn,κn(A) → Pn(A) a.s.-P

0 and Pn(A) → P0(A) a.s.-P0∞. When P0 is a

discrete measure, we show for any A ∈ F : P˜n,κn(A) → ¯Pn(A) a.s.-P

∞ 0 ,

¯

Pn(A) → Pn(A) a.s.-P0∞ and Pn(A) → P0(A) a.s.-P0∞.

Regardless of which P0 is used, we can apply the LLN to show that for any

A ∈ F : Pn(A) → P0(A) a.s.-P0∞. The Xi are i.i.d.-P0, so each δXi is an i.i.d.

random variable and E[δXi(A)] = P0(A). The LLN implies the following

convergence, Pn(A) = n X i=1 δ{Xi∈A} n → P0(A) a.s.-P ∞ 0 .

It remains to show the diffuse case that for any A ∈ F : ˜Pn,κn(A) → Pn(A)

a.s.-P0∞. So for any  > 0 it has to hold that there exists an integer N such that for each n > N :

P0∞(| ˜Pn,κn(A) − Pn(A)| > ) = 0.

In the case where P0 is a diffuse probability measure, for any n ∈ N, κn= n

a.s.-P0n by equation (3.3) and therefore also ni = 1 a.s.-P0n for each i. Hence

by the definitions of Pn,κn, Pn and the product measure P

∞ 0 , for any n ∈ N and A ∈ F , P0∞(| ˜Pn,κn(A) − Pn(A)| > ) = P n 0 1 − σ n − nσ n X i=1 δX∗ i(x) − Pn(x) >  ! = 0. Thus P∞ n=1P ∞

0 (| ˜Pn,κn(A) − Pn(A)| > ) = 0 and ˜Pn,κn(A) → Pn(A) a.s.-P

∞ 0

as n → ∞.

In remains to show for any A ∈ F in the discrete case: ˜Pn,κn(A) → ¯Pn(A)

a.s.-P0∞ and ¯Pn(A) → Pn(A) a.s.-P0∞. The result ˜Pn,κn(A) → ¯Pn(A) a.s.-P

∞ 0

follows if we show that for any  > 0 there exists an N > 0 such that for each n > N ,

(36)

Rewriting yields the following equations, | ˜Pn,κn(A) − ¯Pn(A)| = 1 n − κnσ κn X i=1 (ni− σ)δXi∗(A) − 1 n − κnσ κn X i=1 niδXi∗(A) = −σ n − κnσ κn X i=1 δXi∗(A).

As σ < 1 and κn ≤ n, n − κnσ > 0. Therefore, {| ˜Pn,κn(A) − ¯Pn(A)| > } ⊆

{| − σ|κn/(n − κnσ) > } and it remains to prove for large n, applying the

definition of the product measure P0∞, P0∞ | − σ|κn n − κnσ >   = P0n  |σ|κn n − κnσ >   = 0.

Note that from equation (3.2) it follows that there exists an N (1) such that

for each n > N (1), κn/n < 1. In the case of σ ≤ 0, let 1 < /|σ|. As a

result:  |σ|κn n − κnσ >   ⊆ |σ|κn n >   ⊆ {|σ|1 > } = ∅. So for n > N (1), P0n(| − σ|κn/(n − κnσ) > ) ≤ P0n(∅) = 0 as desired. In

the case of 1 > σ > 0, let 1 = /2σ ∧ 1/(σ). As a result:

 σκn n − κnσ >   ⊆  σκn n − nσ1 >   ⊆ σκ1 n 2n >   ⊆ {2σ1 > } = ∅. So for n > N (1), P0n(| − σ|κn/(n − κnσ) > ) ≤ P0n(∅) = 0 as desired. ¯

Pn(A) → Pn(A) a.s.-P0∞ follows if we show for any  > 0 there exists an

N > 0, such that for each n > N ,

(37)

Rewriting gives rise to the following equations, | ¯Pn(A) − Pn(A)| = 1 n − κnσ κn X i=1 niδX∗ i(A) − 1 n n X i=1 δXi(A) = 1 n − κnσ n X i=1 δXi∗(A) − 1 n n X i=1 δXi(A) = κnσ n(n − κnσ) n X i=1 δXi(A).

Recall that n − κnσ > 0, so {| ¯Pn(A) − Pn(A)| > } ⊆ {κnn|σ|/(n(n − κnσ)) >

} and it remains to prove that for large n, applying the definition of the product measure P0∞, P0∞  κnn|σ| n(n − κnσ) >   = P0n  κnn|σ| n(n − κnσ) >   = 0.

Note that from equation (3.2) it follows that there exists an N (1) such that

for each n > N (1), κn/n < 1. In the case of σ ≤ 0, let 1 = /|σ|. As a

result:  κnn|σ| n(n − κnσ) >   ⊆ κnn|σ| n2 >   ⊆ κn|σ| n >   ⊆ {|σ|1 > } = ∅. So for n > N (1), P0n(κnn|σ|/(n(n − κnσ)) > ) ≤ P0n(∅) = 0 as desired. In

the case of 1 > σ > 0, let 1 < /(2σ) ∧ 1/(2σ). As a result:

 κnnσ n(n − κnσ) >   ⊆  κnσ n − nσ1 >   ⊆ κ1nσ 2n >   ⊆ {2σ1 > } = ∅.

So for n > N (1), P0n(κnnσ/(n(n − κnσ)) > ) ≤ P0n(∅) = 0 as desired and

this completes the proof

(38)

Theorem 3.3. Convergence of a sequence (Pn)n≥1 to P0 under the class of

semi-norms defined by ||P1− P2||2A with A = {Ai}Ni=1 any finite sequence of

measurable disjoint intervals, implies weak convergence on a standard Borel space (X , F ).

Proof. Since (X , F ) is assumed to be a standard Borel space, there exists a countable dense subset {xi : xi ∈ X , i ∈ N}. Therefore countable covering

of arbitrary small δ-spheres Uδ = ∪{x : |x − xi| < δ}, with δ > 0, can be

constructed. Suppose that Uδ is not a covering of X , then a point x∗ and

 outside of Uδ exist such that {x : |x − x∗| < } ∩ X is an open set for

which {xi} does not contain a point. But {xi} was dense so we arrive at a

contradiction. We construct a measurable partition A0 = {A0i}∞

i=1 by letting

A0i = {x| |x − xi| < δ} ∩ {∪i−1j=1A 0

j}, for i = 1, . . . .

Since we work on a standard Borel space, P0 is a tight measure by theorem

A.11. Therefore a compact measurable subset A of X exists, such that P0(A) > 1 − 1. By compactness a finite subcover of A0 for A exists, let

A = {Ai} N (δ)

i=1 . By the convergence under the semi-norm there exists an

integer M > 0 such that for n > M ||Pn− P0||2A < 2 holds P0n-a.s.

We can show weak convergence, as described below Definition 1.12, by letting  > 0 and look at an arbitrary bounded Lipschitz function f . Let f be bounded by M , from the uniform continuity of f (x) it follows that |x−y| < δ implies |f (x) − f (y)| < 3. So we can construct a simple function f∗ with

f∗(x) = f (xi), for each x ∈ Ai and for i = 1, . . . N (δ).

such that for each x ∈ Ai: f (x) = f (xi) + x with |x| < 3. Thus, for any

probability measure P∗, N X i=1 Z Ai f dP∗ ≤ N X i=1 Z Ai f∗(x)dP∗ + N X i=1 Z Ai xdP∗ ≤ ∞ X i=1 |f (xi)|P∗(Ai)+3.

(39)

For any  > 0, let 1 < /(6M ), 2 < (/(3M ))2 and 3 < /6 (δ depends on

3). We finish the proof using the above equation,

Z f dPn− Z f dP0 < N (δ) X i=1 Z Ai f dPn− Z Ai f dP0 + 2M 1 < N (δ) X i=1 M |Pn(Ai) − P0(Ai)| + 23+ 2M 1 ≤ M√2 + 23+ 2M 1 < .

Recall that by the EPPF of a Gibbs-type prior (2.10), the probability of observing a new value after n observations equals Vn+1,κn+1/Vn,κn. In the

next theorem we assume that the following condition holds. Vn+1,κn+1

Vn,κn

converges a.s.-P0∞ to a limit α ∈ [0, 1] as n → ∞. (3.7) One can interpret this fraction as the probability of a new observation at step n + 1.

We proceed by showing in detail the weak convergence of the posterior to a Dirac measure almost surely for a Gibbs-type prior. The requirement of full support for the prior will be explained in section 3.4.

Theorem 3.4. Let ˜p be the random measure of a Gibbs-type prior with prior guess P∗ = E[˜p] and full support. Let X be a separable complete metric space and assume condition (3.7) holds true. Moreover, let (Xi)i≥1be a sequence of

i.i.d. random variables from some probability distribution P0 which is either

discrete or diffuse. Then the posterior converges weakly a.s.-P0∞, to a Dirac measure at αP∗(·) + (1 − α)P0(·).

Proof. The main idea of this proof is that under (3.7) the variance of ˜p, given a sample X(n)featuring κ

n≤ n distinct values, converges to 0 a.s.-P0∞. This

(40)

condi-We show the weak convergence of ˜p to its expectation by deducing an upper bound for its variance: PN

i=1Var[˜p(Ai)|X (n)

κn ], where A = {Ai}

N

i=1, with Ai

measurable disjoint open sets. To see that this actually implies weak conver-gence, we rewrite the sum of posterior variances and invoke lemma 3.3. By the linearity of the expectation,

N X i=1 Var[˜p(Ai)|X(n)] = N X i=1

E[˜p(Ai)2|X(n)] − E[˜p(Ai)|X(n)]2

= E[||˜p − E[˜p|X(n)]||2A|X(n)].

Weak convergence of ˜p to its expectation implies that Qn converges weakly

to a Dirac measure. We use the bounded Lipschitz (BL) distance, dBL

de-scribed in the Appendix A.1 (the BL distance induces the topology on Θ = P the set of all probability measures on X ). Weak convergence implies that dBL(˜p, αP∗+ (1 − α)P0) < 1 a.s.-P0n, see Dudley [5] (Theorem 11.3.3). Let

U := {P : dBL(P, δαP∗+(1−α)P

0) < 1}, then by definition of the posterior

Qn(U ) = 1 a.s.-Pn

0. Since we are considering probability distributions on the

parameter space we can also study weak convergence.

Let  > 0 and f be a bounded Lipschitz function, then there exists for every 2 > 0 a δ > 0 such that |x − y| < δ implies |f (x) − f (y)| < 2 and the

existence of an M > 0 such that |f | < M . Choose 2 < /M and 1 = δ then

as desired Qn converges weakly a.s.-Pn 0: Z f dQn− Z f dδαP∗+(1−α)P 0 ≤ Z U 2M dQn < . (3.8)

By equation (1.15) and the posterior predictive for Gibbs-type priors (2.12), E[˜p(Ai)|X(n)] = P (Xn+1∈ Ai|X(n)) = Vn+1,κn+1 Vn,κn P∗(Ai) +  1 − Vn+1,κn+1 Vn,κn  ˜ Pn,κn(Ai).

It follows from condition (3.7) that Vn+1,κn+1/Vn,κn converges to α a.s.-P

∞ 0 ,

so by theorem 3.2, ˜Pn,κn converges to P0 a.s.-P

0 for all Ai ∈ F . Thus the

posterior expected value converges a.s.-P0∞ to αP∗(·) + (1 − α)P0(·).

It remains to show for any partition A as n → ∞:

N

X

i=1

(41)

We simplify some notation, instead of Ai we write A and for a, b, c, d ∈

{0, 1, . . .},

ga,bc,d(n) = Vn+a,κn+b

Vn+c,κn+d

. (3.9)

So that the expected value of ˜p becomes,

E[˜p(A)|X(n)] = g0,01,1(n)P∗(A) + g0,01,0(n)(n − κnσ) ˜Pn,κn(A). (3.10)

For the second moment of ˜p we apply equation (1.16) and condition on Xn+1,

E[˜p(A)2|X(n)] = P (X n+2 ∈ A, Xn+1 ∈ A|X(n)) = Z A P (Xn+2 ∈ A|X(n), Xn+1= x)P (Xn+1 = x|X(n))dx.

We continue rewriting, where we use the posterior predictive of a Gibbs-type prior (2.12). As usual let the distinct observations in X(n) be denoted by

X1∗, . . . , Xn∗. E[˜p(A)2|X(n)] = Z A g0,01,1(n)P (Xn+2 ∈ A|X(n), Xn+1= x)dP∗(x) + Z A g0,01,0(n) κn X i=1 (ni− σ)P (Xn+2∈ A|X(n), Xn+1 = x)dδX∗ i(x) =g0,01,1(n)P∗(A)( Z A g1,12,2(n)dP∗(x) + Z A g1,12,1(n) κn+1 X i=1 (ni− σ)dδX∗ i(x)) + g1,00,0(n) κn X i=1 (ni− σ)δXi∗(A)( Z A g1,02,1(n)dP∗(x) + Z A g1,02,0(n) κn X j=1 (nj + δi(j) − σ)dδXj∗(x)) = g2,20,0(n)P∗(A)2+ g0,02,1(n) κn X i=1 (ni− σ)δX∗ i(A)P ∗ (A) + g2,1(n)(1 − σ)P∗(A) + g2,1(n) κn X (ni− σ)δX∗(A)P∗(A)

(42)

We rewrite the expression further using equation (3.5), E[˜p(A)2|X(n)] =g2,2 0,0(n)P ∗ (A)2 + 2g0,02,1(n)(n − κnσ) ˜Pn,κn(A)P ∗ (A) + g2,10,0(n)(1 − σ)P∗(A) + g0,02,0(n)(n − κnσ)2P˜n,κn(A) 2 + g2,00,0(n)(n − κnσ) ˜Pn,κn(A). (3.11)

We introduce two new variables:

I(n, κn) := 1 − g1,12,1(n)g 0,0

1,0(n) (3.12)

and

Wn,κn(A) := E[˜p(A)

2|X(n)] − g2,1 1,1(n)g

0,0

1,0(n)E[˜p(A)|X (n)]2.

So that we can rewrite the variance into:

Var[˜p(A)|X(n)] = −I(n, κn)E[˜p(A)|X(n)]2+ Wn,κn(A).

We also rewrite Wn,κn(A), by first substituting the expression derived for

E[˜p(A)|X(n)] (3.10) and then substituting the expression for E[˜p(A)2|X(n)]

(3.11):

Wn,κn(A) =E[˜p(A)

2|X(n)] − g2,1 1,0(n)g 1,1 0,0(n)P ∗ (A)2 − 2g2,10,0(n)(n − κnσ) ˜Pn,κn(A)P ∗ (A) − g2,11,1(n)g0,01,0(n)(n − κnσ)2P˜n,κn(A) 2 =g1,00,0(n)(n − κnσ) ˜Pn,κn(A)((g 2,0 1,0(n) − g 2,1 1,1(n))(n − κnσ) ˜Pn,κn(A) + g0,01,1(n)P∗(A)(g1,12,2(n) − g1,02,1(n)P∗(A) + g2,11,1(n)(1 − σ)) + g1,02,0(n)).

By lemma C.1 and lemma C.2: Wn,κn(A) =g 1,0 0,0(n)(n − κnσ) ˜Pn,κn(A)((g 2,0 1,0(n) − g 2,1 1,1(n))(n − κnσ) ˜Pn,κn(A) + I(n, κn) − (g1,02,0(n) − g 2,1 1,1(n))(n − κnσ)) + g1,10,0(n)P ∗ (A) × (g2,21,1(n) − g1,02,1(n))P∗(A) + I(n, κn) − (g2,21,1(n) − g 2,1 1,0(n)) 

=I(n, κn)E[˜p(A)|X(n)]

+ g0,01,0(n)(n − κnσ)2((g2,01,0(n) − g 2,1

1,1(n)) ˜Pn,κn(A)( ˜Pn,κn(A) − 1)

(43)

We introduce the variable Zn,κn(A), where for any a ∈ R a+ := max{a, 0}, Zn,κn(A) :=g 1,0 0,0(n)(n − κnσ)2  (g2,01,0(n) − g2,11,1(n))+P˜n,κn(A)  + g0,01,1(n)(g1,12,2(n) − g1,02,1(n))+P∗(A).

Clearly Wn,κn(A) ≤ I(n, κn)E[˜p(A)|X

(n)] + Z

n,κn(A), so that

Var[˜p(A)|X(n)] ≤ I(n, κn)E[˜p(A)|X(n)](1 − E[˜p(A)|X(n)]) + Zn,κn(A).

Again by lemma C.1 and lemma C.2:

Zn,κn(A) :=g

1,0

0,0(n)(n − κnσ)(g2,01,0(n) − I(n, κn))+P˜n,κn(A)

+ g0,01,1(n)(g1,12,1(n)(1 − σ) − I(n, κn))+P∗(A).

For any b ∈ R, let b−:= b − b+ and let

J (n, κn) := (g1,12,1(n)(1 − σ−) − I(n, κn))+.

So that clearly (g1,12,1(n)(1 − σ) − I(n, κn))+ ≤ J(n, κn), combined with lemma

C.3 this yields: Zn,κn(A) ≤ g 1,0 0,0(n)(n − κnσ)J (n, κn) ˜Pn,κn(A) + g 1,1 0,0(n)J (n, κn)P∗(A). = J (n, κn)E[˜p(A)|X(n)].

Applying this gives rise to the following expression for the variance:

Var[˜p(Ai)|X(n)] ≤I(n, κn)E[˜p(A)|X(n)](1 − E[˜p(A)|X(n)]

+ J (n, κn)E[˜p(A)|X(n)].

Note that E[˜p(A)|X(n)] ∈ [0, 1] and that for any x ∈ [0, 1] it holds that x(1 − x) ≤ 1. We derive the following upper bound,

Referenties

GERELATEERDE DOCUMENTEN

Based on the result that the participants referred to either leadership, organizational structure and reward systems, and/or characteristics and personalities of the

Binding of apoE-rich high density lipoprotein particles by saturable sites on human blood platelets inhibits agonist- induced platelet aggregation.. The effect of blood

e evaluation of eHealth systems has spanned the entire spectrum of method- ologies and approaches including qualitative, quantitative and mixed methods approaches..

Recipients that score low on appropriateness and principal support thus would have low levels of affective commitment and high levels on continuance and

To study complementarities with existing goods and service offers, this research focuses purely on in-kind donations flowing through charitable organisations rather

On my orders the United States military has begun strikes against al Qaeda terrorist training camps and military installations of the Taliban regime in Afghanistan.. §2 These

[r]

\selectVersion, eqexam will perform modular arithmetic on the number of available versions of a problem, in this way each problem will be properly posed; consequently, when we