A Diffusion Approach to Posterior Contraction Rates

(1)

MSc Mathematics

Master Thesis

A Diffusion Approach to Posterior

Contraction Rates

Author: Supervisor:

Geert Doek

prof. dr. J.H. van Zanten

Examination date:

(2)

Abstract

We study the behavior of the posterior distribution in Bayesian statistics from a fre-quentist point of view. In particular, we want to know under what conditions and at which speed the posterior concentrates around the true parameter, when the amount of data increases relative to the dimension of the statistical model. We investigate a new method of establishing posterior contraction rates. By approximating the posterior by a diffusion process, it will be shown that the posterior contracts at ratepd/n, where d is the dimension of the parameter and n is the sample size. The conditions and results will be compared to the classical Ghosal-Ghosh-Van der Vaart approach, which relies on the existence of tests with exponentially small error probabilities. Several examples will be studied in detail.

Title: A Diffusion Approach to Posterior Contraction Rates Author: Geert Doek, 12418994@student.uva.nl, 12418994 Supervisor: prof. dr. J.H. van Zanten (VU)

Second Examiner: prof. dr. P.J.C. Spreij (UvA) Examination date: August 17, 2020

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 105-107, 1098 XG Amsterdam http://kdvi.uva.nl

(3)

Acknowledgements

The completion of this thesis concludes two years of study towards the master’s degree in mathematics. The past six months of doing research and writing this thesis have been challenging, but I have learned a lot from the process. I would like to thank my supervisor Harry van Zanten for his support during this project. He helped me choose a topic and provided many useful suggestions for my research. I also thank Peter Spreij, who was willing to be the second examiner.

(4)

Introduction

Traditionally, there have been two schools of thought in statistics, namely the frequentist view and the Bayesian view. The frequentist assumes the existence of a true underlying distribution for the data. Since this distribution is unknown, one tries to estimate this distribution from the data as accurately as possible. In contrast, a Bayesian does not assume the existence of one true distribution, but instead models the uncertainty about the data distribution with another probability distribution, known as the prior distri-bution. Conditional on the data, the Bayesian then updates his or her belief about the data distribution. In recent years Bayesian inference has gained popularity because of an increase in available computing power, which has made it possible to sample from complicated posterior distributions.

While a frequentist has to choose a method of estimation, a Bayesian only chooses a prior distribution. In the true Bayesian context, the prior distribution models the prior belief, and can thus be based on an expert opinion, for instance. One can however also use Bayesian methods in the frequentist context. Assuming the data comes from a fixed underlying distribution, one can still choose a prior and thus find a posterior distribu-tion. The posterior can then be used to obtain an estimator of the true distribution, for instance the posterior mean. If the prior is chosen in a suitable way, we can expect the posterior to concentrate around the true data distribution, in which case the estimator will be accurate. This is indeed the viewpoint that we take in this thesis. The goal of this thesis is to quantify the convergence of the posterior, as the amount of data increases. It turns out that if the statistical model and prior distribution satisfy certain conditions, then the posterior converges at a rate of order pd/n, where d is the dimension of the statistical model, and n is the sample size. All of this will be made rigorous in the sequel. Some early authors that considered convergence of the posterior are Doob [1] and Schwartz [18]. They formulated conditions under which the posterior is consistent, which means that the posterior will eventually concentrate almost all its mass in ar-bitrarily small neighborhoods of the true parameter. Posterior contraction rates are more informative than posterior consistency, since they characterize the speed at which the posterior converges. A posterior is said to contract at rate εn if

Πn(P : d(P, P0) ≥ Mnεn|X1, . . . , Xn) Pn

0

→ 0,

for all Mn → ∞. Traditionally, posterior contraction rates have been established by

methods that depend on the existence of tests. This is often guaranteed by an entropy condition on the parameter space, see [3] and [7]. These results are thus also applicable in non-parametric models. Another requirement is that sufficient prior mass needs to be assigned to a neighborhood of the truth, where the neighborhood is defined in terms

(6)

of Kullback-Leibler divergence P0(log p/p0) and P0(log p/p0)2. In parametric models,

we can often take εn of the order 1/

√

n. An even stronger result that characterizes the convergence of the posterior is the Bernstein-von Mises theorem. A version of this theorem can be found in [20]. It states that if, apart from the existence of consistent tests, the model satisfies some smoothness conditions, then the posterior of √n(θ − θ0)

converges in total variation distance to a Gaussian distribution. In particular, this implies that asymptotically the posterior does not depend on the prior.

In [11], Mou, Ho et al. suggest to analyze the convergence of the posterior distribution by studying a stochastic process that has the posterior as its stationary distribution. If π is a prior density on some parameter space Θ ⊆ Rd, and given θ ∈ Θ, the data X1, . . . , Xnare i.i.d. with density pθ, then Bayes’ rule asserts that the posterior density

satisfies π(θ|X1, . . . , Xn) ∝ n Y i=1 pθ(Xi)π(θ).

The posterior distribution will be approximated through a stochastic process, known as the Langevin diffusion, which is the solution of the stochastic differential equation (SDE)

dYt= −∇U (Yt) dt +

r 2 β dWt.

Under suitable conditions on U , the law of Yt converges to the distribution with density

proportional to exp(−βU ). Treating the data X1, . . . , Xn as fixed, it follows that the

solution to dθt= 1 2 n X i=1 ∇ log p_θ_t(Xi) dt + 1 2∇ log π(θt) dt + 1 √ ndWt

has the posterior as stationary distribution. Assume that θ0 is the true underlying

parameter of the statistical model. By controlling the second moment of (θt) as t → ∞,

we find a constant C > 0 depending on the problem parameters, such that for all δ ∈ (0, 1) Π ||θ − θ0|| ≥ C r d δn X1, . . . , Xn ! < δ with P_θn 0-probability at least 1 − δ.

The Langevin diffusion process was first studied by physicists, as it models the move-ment of a particle through space under some random forces, which are modeled by Brownian motion. The fact that this process in equilibrium can approximate a given probability distribution attracted the interest of the statistics community. Markov Chain Monte Carlo (MCMC) techniques are used to sample from complicated distributions, for instance posterior distributions in Bayesian statistics. These techniques rely on the con-struction of chains that quickly converge to a certain distribution. The Langevin diffusion (and its discretizations) can thus be used in MCMC algorithms (see [2], [15]). Often, a numerical scheme for solving SDEs is combined with a Metropolis-Hastings step, where a new proposal from the transition distribution can be rejected. The application of the Langevin diffusion to study posterior convergence seems to be a novel idea, introduced

(7)

in [11].

In this thesis, we will show in detail how the diffusion and testing approaches to deriving posterior contraction rates work. We will also compare the conditions and final state-ments of both methods, and illustrate them through a number of examples.

We imagine our reader as a master student in mathematics, specializing in stochastics. Therefore, we assume that the reader is familiar with measure theoretic probability and stochastic processes in continuous time. It is also helpful to have taken a few statistics courses, covering some asymptotic theory and Bayesian statistics. In general, I have included proofs of results that I, as a second-year master student, have not yet encoun-tered in one of my courses. However, some complicated proofs that rely on techniques that are very different from the main story are omitted.

The outline of this thesis is as follows. In Chapter 1, we will carefully study the Langevin diffusion and the convergence to its stationary distribution. Next, in Chapter 2 we use a testing approach to establish a posterior contraction rate, while keeping track of the dimension d. In Chapter 3, we find the posterior contraction rate using the convergence of the Langevin diffusion, and we compare the result with that of Chapter 2. Finally, in Chapter 4 we study a number of examples and illustrate the results.

(8)

1. Langevin diffusions

In this chapter, we study the convergence of the Langevin diffusion to its stationary distribution. Although this convergence is a key result in the analysis of the posterior distribution, this chapter does not involve any statistics. We assume that the reader is familiar with the most basic definitions of Markov processes on general state spaces, both in discrete- and continuous time. We will introduce the minimal theory on Feller processes that we need to discuss stationarity. Section 1.1 lays the foundation for this chapter. In Section 1.2 we show that diffusion processes fit in this framework. In Section 1.3, the convergence of a general Markov process is studied, and this can be read independently of Section 1.2. The Langevin diffusion is introduced in Section 1.4. Section 1.5 combines the results of Sections 1.3 and 1.4 into a convergence result for Langevin diffusions. Finally, in Section 1.6 we study a special case, to illustrate the general results.

1.1. Feller processes, generators and stationarity

In this section, we provide the definitions that are necessary to discuss stationarity. Let (Ω, F , P) be a probability space and F = (Ft)t≥0 a filtration that satisfies the usual

conditions. Let X = (Xt)t≥0 be a Markov process (w.r.t. the filtered space (Ω, F , F, P))

on some state space (E, E ) with Markov semi-group or transition function (Pt)t≥0. For

our purposes, we can assume that E is a subset of Euclidean space Rd and E is the restriction of the Borel σ-algebra.

Let Cb = Cb(E) denote the Banach space of bounded continuous functions on E, endowed

with the supremum-norm. Throughout this chapter, || · || always denotes the supremum norm. Let C0 = C0(E) and Cc = Cc(E) denote the subspaces of continuous functions

that vanish at infinity, and continuous functions with compact support, respectively. Note that C0 is the closure of Cc in Cb. For each t ≥ 0, Pt acts on bounded measurable

functions through

Ptf (x) :=

Z

f (y)Pt(x, dy).

In fact, we can say more.

Proposition 1.1. If Ptf is continuous for f ∈ Cb, then Pt is a contraction on Cb.

Proposition 1.2. For all f ∈ Cb, P0f = f .

The proofs follow directly from the definitions. We need some more regularity on the Markov semi-group.

(9)

Definition 1.3. (Pt) is called a Feller transition function if PtC0 ⊆ C0 for all t ≥ 0,

and for all f ∈ C0 it holds that

lim

t↓0 ||Ptf − f || = 0.

We will assume throughout this section that (Pt) is a Feller transition function, i.e. X

is a Feller process. The following definitions and results consider the generator of (Pt),

which gives an alternative characterization of the behaviour of X.

Definition 1.4. Let D(L) be the set of all functions f ∈ C0(E) for which the limit

Lf := lim

t↓0

Ptf − f

t

exists in C0. The operator L : D(L) → C0 is called the (infinitesimal) generator of X.

Thus, the generator describes the instantaneous change of the transition probabilities at time zero. In the definition above, existence of this point-wise limit is already sufficient for a function to be in the domain of the generator, as the next proposition shows. Proposition 1.5.

D(L) =

f ∈ C0 : ∃g ∈ C0 such that ∀x ∈ E, lim t↓0

Ptf (x) − f (x)

t = g(x)

. Proof. A proof can be found in Theorem 7.19 of [17].

We need some control over the domain of the generator. Proposition 1.6. For every t ≥ 0

PtD(L) ⊆ D(L).

Proof. Let f ∈ D(L), then Lf ∈ C0. We will show that Ptf ∈ D(L) and LPtf = PtLf .

Let s > 0, then using that PsPt= Ps+t= PtPs and that Pt is a contraction, we have

PsPtf − Ptf s − PtLf = PtPsf − Ptf s − PtLf = Pt Psf − f s − Lf ≤ Psf − f s − Lf .

This quantity can be made arbitrarily small by taking s small enough.

The semi-group property (i.e. Pt+s = PtPs) also allows one to prove the following

theorems, which describe the instantaneous change of the transition probabilities at every time.

(10)

Theorem 1.7 (Kolmogorov backward equation). For f ∈ D(L) we have d

dtPtf (x) = LPtf (x).

Proof. By Proposition 1.6, the expression on the right-hand side is well-defined, i.e. Ptf ∈ D(L). We have d dtPtf (x) = d dsPt+sf (x) s=0 = d dsPsPtf (x) s=0 = LPtf (x).

Theorem 1.8 (Kolmogorov forward equation). For f ∈ D(L) we have d dtPtf (x) = PtLf (x). Proof. lim s↓0 Pt+sf (x) − Ptf (x) s − PtLf (x) ≤ lim s↓0 Pt Psf − f s − Lf ≤ lim s↓0 Psf − f s − Lf = 0.

The generator will turn out to be a useful tool in the study of stationary distributions. The kernel Pt acts on the space of probability measures through

(Ptµ)(B) :=

Z

Pt(x, B) dµ(x).

For a given probability measure µ and bounded measurable function f , we also define µ(f ) :=

Z

f (x) dµ(x).

Note that Ptµ(B) = µ(Pt1B). A distribution µ is stationary for X if, when being

distributed according to µ at some time t, Xs ∼ µ for all future times s > t. This is

formalized in the next definition.

Definition 1.9. A probability measure µ is called a stationary distribution if Ptµ = µ

(11)

Since every bounded measurable function is the increasing point-wise limit of a se-quence of linear combinations of indicator functions, stationarity of µ also implies that

µ(Ptf ) = µ(f )

for all f ∈ Cb. It follows that

µ(Lf ) = µ lim t↓0 Ptf − f t = lim t↓0 1 tµ(Ptf − f ) = 0,

for f ∈ D(L), where we apply the Dominated Convergence Theorem (use Lf + 1 as the dominating function). A converse statement is also true, in some cases. To prove this, it would be useful if Lfn → Lf , whenever fn → f . Unfortunately, this is in general

not the case, as L is an unbounded operator. In a special case however, a similar result holds, but under a different norm (we will see this later).

For p ≥ 1, let Lp(µ) be the space of measurable functions f : E → R for which Z

|f |pdµ < ∞.

Note that we can view L as an (unbounded) operator on Lp(µ), as C0 ⊆ Lp(µ).

Definition 1.10. A subspace W ⊆ D(L) is a core of L in Lpµ (p ≥ 1) if W is dense in

D(L) under the graph norm ||f || := ||f ||p+ ||Lf ||p.

We also state an approximation result. Let C_c2 = C_c2(E) be the space of twice-differentiable functions with compact support.

Lemma 1.11. C_c2 is dense in C0.

Proof. This follows from the locally compact version of the Stone-Weierstrass theorem.

Theorem 1.12. Suppose that µ(Lf ) = 0 for all f ∈ C_c2 and that C_c2 is a core of L in Lp(µ). Then µ is stationary.

Proof. In this proof, all limits should be interpreted in a point-wise sense. Let f ∈ C_c2 and t ≥ 0, then by Proposition 1.6, Ptf ∈ D(L). Since Cc2 is a core, there exists a

sequence (hn) ⊂ Cc2 such that ||hn− Ptf ||p+ ||Lhn− LPtf ||p → 0. In particular, we can

take a subsequence (again denoted by (hn)) such that Lhn a.e.

→ LPtf . The Kolmogorov

backward equations (Theorem 1.7) and two applications of the Dominated Convergence Theorem imply that

d dtµ(Ptf ) = µ d dtPtf = µ (LPtf ) = µ( lim n→∞Lhn) = limn→∞µ(Lhn) = 0.

Interchanging the derivative and measure is justified, since _dtdPtf = LPtf ∈ C0 is

(12)

Theorem, use LPtf + 1 as dominating function. Since t ≥ 0 is arbitary, we thus get

µ(Ptf ) = µ(P0f ) = µ(f ) by a corollary of the Mean Value Theorem. Now let B be a

measurable set. Without loss of generality (Dynkin’s Lemma), we can assume B is a rectangle. We will approximate 1B by some sequence of functions (fn) ⊂ C0, defined as

fn(x) =      1 if x ∈ B 1 − nd(x, B) if 0 < d(x, B) < n−1 0 otherwise ,

It is clear that fn→ 1B. Also, by Lemma 1.11 we can find gn∈ Cc2such that ||fn−gn|| ≤

2−n, from which it follows that gn→ 1B. Applying the Dominated Convergence Theorem

several times, we get

Ptµ(B) = µ(Pt1B) = µ(Pt lim

n→∞gn) = µ( limn→∞Ptgn)

= lim

n→∞µ(Ptgn) = limn→∞µ(gn) = µ( limn→∞gn) = µ(B).

1.2. Diffusions as Feller processes

In this section, we assume familiarity with the basics of stochastic calculus, in particular Ito’s formula. Let b : Rd→ Rd _{and σ : R}d _{→ R}d×m _{be Lipschitz continuous functions.}

Also, let W be an m-dimensional Brownian motion on some filtered probability space (Ω, F , F, P). Then for every F0-measurable random vector ξ with E||ξ||22< ∞, the SDE

dXt= b(Xt) dt + σ(Xt) dWt, (1.1)

admits a unique strong solution X = (Xt)t≥0with X0 = ξ. See for instance Theorem 2.9

in [6]. In particular, X is a continuous, progressively measurable, and square-integrable Rd-valued stochastic process.

Proposition 1.13. The unique solution of (1.1) is a time-homogeneous Markov process. Proof. For the proof, we refer to Theorem 2.9.1 of [10].

Let (Pt) be the Markov semi-group of X. We can now show that a diffusion fits into

the framework of the previous section. Proposition 1.14. (Pt) is Feller.

Proof. We first show that x 7→ Ptf (x) is continuous, for fixed t ≥ 0 and f ∈ C0(Rd).

Let Xx be the solution to (1.1) with initial condition x ∈ Rd. Using that ||Pn

(13)

nPn i=1||zi||2, we get E||Xtx− X y t|| 2 =E x − y + Z t 0 (b(X_sx) − b(X_sy)) ds + Z t 0 (σ(X_sx) − σ(X_sy)) dWs 2 ≤3||x − y||2+ 3E Z t 0 (b(X_sx) − b(X_sy)) ds 2 + 3E Z t 0 (σ(X_sx) − σ(X_sy)) dWs 2 ≤3||x − y||2_{+ 3tE} Z t 0 ||b(X_sx) − b(X_sy)||2 ds + 3E Z t 0 ||σ(X_sx) − σ(X_sy)||2ds ≤3||x − y||2+ 3L2(1 + t) Z t 0 E||Xsx− Xsy||2ds,

where L > 0 is the Lipschitz constant on the coefficients b and σ. Here we applied the Cauchy-Schwarz inequality on L2[0, t] to get the square inside the Lebesgue integral and used the Ito isometry for the stochastic integral. Then Gronwall’s inequality (Theorem 1.8.1 in [10]) shows that

E||Xtx− X y

t||2≤ C||x − y||2,

for some constant C > 0 depending only on L and t. Fixing x, it follows that limy→xE||Xtx−

X_ty||2 _{= 0, so in particular X}y t

P

→ Xx

t. Since f is continuous, also f (X y t)

P

→ f (Xx t). The

collection {f (X_ty) : ||y − x|| ≤ 1} is uniformly bounded, hence also uniformly integrable. It follows that

lim

y→x|Ptf (y) − Ptf (x)| ≤ limy→xE|f (X y

t) − f (Xtx)| = 0.

Next, we show that Ptf vanishes at infinity. By Corollary 18.21 in [17], Xtx a.s.

→ ∞ when |x| → ∞. Since f ∈ C0, it follows that f (Xtx)

a.s.

→ 0. Using the Dominated Convergence Theorem, we thus get

lim |x|→∞Ptf (x) = lim|x|→∞Ef (X x t) = E lim |x|→∞f (X x t) = 0.

We conclude that Ptf ∈ C0 if f ∈ C0. Now take f ∈ C0 and (fn) ⊂ Cc2 such that

||f_n− f || → 0 (use Lemma 1.11). Then, since P_tis a contraction, we have ||P_tf − f || = ||Ptf − Ptfn+ Ptfn− fn+ fn− f ||

≤ ||Pt(f − fn)|| + ||Ptfn− fn|| + ||fn− f ||

≤ 2||fn− f || + ||Ptfn− fn||.

(14)

get |Ptfn(x) − fn(x)| = E Z t 0 ∇f (X_sx)>b(X_sx) +1 2 d X i=1 d X j=1 aij(Xsx)Dij2f (Xsx) ds ≤ E Z t 0 ||∇f (X_sx)||||b(X_sx)|| + 1 2 d X i=1 d X j=1 |aij(Xsx)||Dij2f (Xsx)| ds ≤ Ct,

for some large constant C > 0, since f and its derivatives are continuous on a compact domain, hence all terms are bounded. This implies that limt↓0||Ptfn− fn|| = 0. We

conclude that X is a Feller process.

We are interested in the generator of X. Let a(x) = σ(x)σ(x)> denote the diffusion matrix of X.

Theorem 1.15. C_c2 ⊆ D(L) and for f ∈ C2

c we have Lf = ∇f>b + 1 2 d X i=1 d X j=1 aijDij2f.

Proof. Let f ∈ C_c2, then by Ito’s formula, we have f (X_tx) − f (x) = Z t 0 ∇f (Xx s)>b(Xsx) + 1 2 d X i=1 d X j=1 aij(Xsx)Dij2f (Xsx) ds + Z t 0 ∇f (X_sx)>σ(X_sx) dWs. (1.2)

Since ∇f has compact support and ∇f and a are continuous, we have E Z t 0 ∇f (X_sx)>a(X_sx)∇f (X_sx) ds ≤ E Z t 0 C ds < ∞,

therefore the stochastic integral (1.2) is a martingale, so its expectation vanishes. Using Fubini’s Theorem, we thus get for x ∈ Rd

Ptf (x) − f (x) = Z t 0 E  ∇f (X_sx)>b(X_sx) +1 2 d X i=1 d X j=1 aij(Xsx)D2ijf (Xsx)  ds. Note that the integrand is continuous in s, since Xx _{has continuous paths, and the}

Dominated Convergence Theorem applies. The Fundamental Theorem of Calculus thus implies lim t↓0 1 t(Ptf (x) − f (x)) = ∇f (x) >_{b(x) +} 1 2 d X i=1 d X j=1 aij(x)Dij2f (x). Also,∇f>_{b +} 1 2P P aijD 2 ijf ∈ C2

c ⊂ C0. Therefore, the result follows from

(15)

1.3. Convergence of Markov processes

As said before, a Markov process in stationarity will remain stationary. In this section, we study conditions under which a (discrete-time) Markov process reaches its stationary distribution. Note that we can view continuous-time Markov processes as discrete-time Markov processes by observing the process at equally spaced points in time. Throughout, this section, let X = (Xn)n∈N be a Markov chain on E ⊆ Rd with transition kernel P .

We state a version of Harris’ theorem, as given in [5].

Theorem 1.16 (Harris’ Theorem). Assume there exists V : E → [0, ∞), K ≥ 0 and γ ∈ (0, 1) such that

(P V )(x) ≤ γV (x) + K (1.3) for all x ∈ E. Also assume that there exist α ∈ (0, 1) and a probability measure ν such that

inf

{x∈E:V (x)≤R}P (x, ·) ≥ αν(·), (1.4)

for some R > 2K(1 − γ). Then P admits a unique invariant measure µ. Define the weighted supremum norm on measurable functions f : E → R

||f ||_V = sup

x∈E

|f (x)| 1 + V (x). Then there exist constants C > 0 and ρ ∈ (0, 1) such that

||Pnf − µ(f )||V ≤ Cρn||f − µ(f )||V

for every measurable function f with ||f ||V < ∞.

Condition (1.3) is called a geometric drift condition and (1.4) is called a minorization condition. The full proof can be found in [5]. The main step is to show that the operator P is a strict contraction with respect to some norm that is equivalent to || · ||V. The

proof of this fact uses condition (1.3) for the case V (x) + V (y) ≥ R and (1.4) for the case V (x) + V (y) ≤ R. In this thesis, we will need a slightly different mode of convergence. For an introduction to total variation distance, we refer to Appendix A.

Corollary 1.17. Let the assumptions of Harris’ Theorem hold. Suppose X has a sta-tionary distribution Π with Lebesgue density π, and Pn(x, ·) has Lebesgue density pn(x, ·),

for all x ∈ E and n ∈ N. Then for all x0 ∈ E there exists an increasing sequence (tk)

such that for almost every y ∈ Rd lim

k→∞ptk(x0, y) = π(y).

(16)

Harris’ theorem. Clearly, ||1B||V ≤ ||1B|| ≤ 1 < ∞, therefore dT V (Pn(x0, ·), Π) = 2 sup B |Pn(x0, B) − Π(B)| = 2(1 + V (x0)) sup B |Pn_(x 0, B) − Π(B)| 1 + V (x0) ≤ 2(1 + V (x₀)) sup B ||Pn1B− Π(B)||V ≤ 2(1 + V (x₀))Cρnsup B ||1_B− Π(B)||_V ≤ 2Cρn(1 + V (x0)) sup B ||1_B− Π(B)|| ≤ 4Cρn(1 + V (x0)).

Clearly, the last expression goes to 0 as n → ∞. Since the total variation distance equals the L1-distance of densities (Proposition A.5)

lim n→∞ Z |p_n(x0, y) − π(y)| dy = lim n→∞dT V(P n_(x 0, ·), Π) = 0.

The result now follows, since L1-convergence (on a σ-finite measure space) implies almost everywhere convergence of a subsequence.

We note that the almost sure convergence of densities is in general stronger than convergence in distribution.

1.4. The Langevin diffusion

The goal of this section is to find the stationary distribution of the Langevin diffusion. Assumption 1.18. Let U : Rd→ R be a smooth function with a Lipschitz continuous gradient. That is, there exists a universal constant L > 0 such that for all x, y ∈ Rd

||∇U (x) − ∇U (y)|| ≤ L||x − y||. (1.5) Also, assume that

Z

e−U (x)dx < ∞ (1.6)

Definition 1.19. The Langevin diffusion with potential function U is the unique strong solution to the SDE

dXt= −∇U (Xt) dt +

√

2 dWt, (1.7)

where W is a d-dimensional Brownian motion.

(17)

Corollary 1.20. C_c2 ⊆ D(L) and for all f ∈ C2 c we have Lf = −∇f>∇U + ∆f where ∆ = _∂x∂2 1 + . . . + ∂ ∂x2 d is the Laplacian.

Now Theorem 1.12 gives the following result for the Langevin diffusion.

Theorem 1.21. Let µ be the distribution with density proportional to exp(−U ). Then µ is a stationary distribution of X.

Proof. From Theorem 8.1.26 in [9], it follows that C_c2 is a core in Lp(µ) of the Langevin diffusion, for any p ≥ 1. Therefore, by Theorem 1.12 it suffices to check that µ(Lf ) = 0 for all f ∈ C2

c. The result now follows from integration by parts, and Fubini’s theorem.

µ(Lf ) = Z −∇f (x)>∇U (x) + ∆f (x)e−U (x)dx = Z . . . Z d X i=1 −∂f (x) ∂xi ∂U (x) ∂xi e−U (x)+∂ 2_{f (x)} ∂x2_i e −U (x) dx1. . . dxn = d X i=1 Z . . . Z Z ∂ ∂xi ∂f (x) ∂xi e−U (x) dxidx1. . . dxn= 0.

The tails of the inner integrals vanish because f and its derivatives have compact support.

1.5. Convergence of the Langevin diffusion

In this section, we study conditions under which the Langevin diffusion (1.7) converges to its stationary distribution. Apart from the Lipschitz condition (1.5), we need the following lower- and upper bounds on the growth of U , to guarantee convergence. Assumption 1.22. There exists constants k, c, C > 0 such that for sufficiently large x we have

hx, ∇U (x)i ≥ c||x||2k _(1.8)

and

|∆U (x)| ≤ C||x||2k−2 (1.9) Using the Fundamental Theorem of Calculus and (1.8), one shows that U (x) ≥ c||x||2k, for some small constant c > 0 and all large enough x. This implies (1.6), so the target distribution is well-defined. The following result guarantees the existence of transition densities. This is a special case of H¨ormander’s Theorem, the proof of which is very involved and relies heavily on the Malliavin calculus.

Proposition 1.23. If U ∈ C∞(Rd), then for every t > 0 and x ∈ Rd, the law of X_tx admits a smooth density y 7→ pt(x, y).

(18)

Proof. We use the notation of H¨ormander’s theorem as presented in Theorem 2.3.2 in [12]. The vector fields A0, A1, . . . , An : Rd → Rd become A0(x) = −∇U (x) and

Aj(x) =

√

2ej for j = 1, . . . , d, where ej is the jth standard unit vector. (Note that

the Ito and Stratonovich integrals coincide because the diffusion coefficient is constant). Therefore, for any x ∈ Rd the vectors A1(x), . . . , An(x) span Rd so H¨ormander’s

con-dition holds. The boundedness of the partial derivatives of A0, A1, . . . , Ad follows since

∇U is Lipschitz. Theorem 2.3.2 in [12] then guarantees the existence of a density. The smoothness of the densities follows from Theorem 2.3.3 in [12].

We are now ready to state the main result of this chapter.

Theorem 1.24. Suppose U ∈ C∞(Rd) and Assumptions 1.18 and 1.22 hold. Let y 7→ pt(x, y) be the transition density of Xtx, and set Z :=R exp(−U ). Then for all x ∈ Rd

there exists an increasing sequence 0 = t0< t1 < . . . such that ptk(x, ·)

a.e.

→ Z−1_{exp(−U )}

as k → ∞.

Proof. By Theorem 1.21, Z−1exp(−U ) is a stationary density of the Langevin diffusion, so also of its discretization (Xn)n∈N. Let P be the transition kernel of (Xn). We check

the conditions of Harris’ Theorem. Let V := exp(βU ) for some β > 0, then Corollary 1.20 implies

LV = β(∆U + (β − 1)||∇U ||2)V. By the Cauchy-Schwarz inequality and (1.8),

||x||||∇U (x)|| ≥ hx, ∇U (x)i ≥ c||x||2k_,

so ||∇U (x)||2 ≥ c2_||x||4k−2 _{for large x. Furthermore, because of (1.9), we will have}

∆U (x) ≤ c||x||2k−2, which is thus of smaller order as ||x|| → ∞. Therefore, taking β ∈ (0, 1) small enough, we see that LV ≤ −CV for some constant C > 0, for large enough x. It follows that there exists a constant D > 0 such that LV ≤ −CV + D. By Kolmogorov’s forward equations (Theorem 1.8)

d

dtPtV (x) = PtLV (x) ≤ Pt(−CV + D)(x) = −CPtV (x) + D, since Ptis monotone. Using this expression, we find that

d dt e

Ct_P

tV (x) ≤ DeCt.

Integrating this expression and rearranging terms, we get PtV (x) ≤ D C + V (x) −D C e−Ct.

In particular, assumption (1.3) holds if we take γ = exp(−C) ∈ (0, 1) and K = D/C > 0. Next, we consider the minorization condition (1.4). In [4], lower bounds on the density of more general diffusion processes are proven. We mimic this approach. Consider the

(19)

stochastic process Yt = x +

√

2Wt and define Z to be the stochastic exponential of

−∇U (Y )/√2, i.e. Zt= exp −√1 2 Z t 0 h∇U (Y_s), dWsi − 1 4 Z t 0 ||∇U (Y_s)||2ds .

Since ∇V is Lipschitz, we can use Benes’ criterion (Corollary 3.5.16 in [6]) to show that Z is a martingale. On the finite interval [0, 1], the martingale Z = (Zs)0≤s≤t has a

’last’ element Z1, therefore Z is a uniformly integrable martingale. Define a probability

measure Q by dQ/dP = Z1. Then by Girsanov’s theorem the process B defined by

Bt= Wt+ 1 √ 2 Z t 0 ∇U (Ys) du

is a Q-Brownian motion. It follows that

dYt= −∇U (Yt) dt +

√ 2 dBt

with Y0 = x. In particular, Y is a weak solution to the Langevin SDE. Since ∇U is

continuous, the integral

Z 1

0

||∇U (Xs)||2ds

is finite almost surely, and the same holds if we replace X with Y . Therefore, by Proposition 3.3.10 in [6], X and Y have the same law. Now let f be a bounded measurable function, then Z f (y)p(x, y) dy = EQf (Y1) = Ef (Y1)Z1 = E [f (Y1)E[Z1|Y1]] = Z f (y)E[Z1|Y1 = y] 1 √ 2φ y − x √ 2 dy,

where φ is the standard Gaussian density. It follows that the transition density of X satisfies p(x, y) = E[Z1|Y1 = y] 1 √ 2φ y − x √ 2 ,

with equality for almost every y. By Jensen’s inequality, it follows that p(x, y) ≥ √ 1

2E[Z1−1|Y1= y]

φ y − x√ 2

It is shown in [4] that E[Z1|Y1= y] can be bounded from below by some strictly positive

quantity that is continuous in x and y. Here the law of a diffusion bridge is needed. Since y 7→ p(x, y) is smooth, we conclude that p(x, y) ≥ f (x, y) for some positive continuous function f . We now construct a minorizing probability measure. Let C = {x ∈ Rd :

(20)

V (x) ≤ R} be the compact set in condition (1.4), with R chosen big enough. For every y, set ε(y) = minx∈Cf (x, y). The continuity and positivity of f then imply ε(y) > 0.

Since y 7→ f (x, y) is measurable, y 7→ ε(y) is measurable as well. Set B = R ε(y) dy, then clearly 0 < B ≤ 1. Define the probability measure ν on Rd by setting its Lebesgue density equal to_dλdνd(y) = B

−1_{ε(y) and set α = Z, then indeed the minorization condition}

(1.4) holds.

By Proposition 1.23 the transition kernels admit densities. The claim of the theorem thus follows from Corollary 1.17.

We have now proven the convergence of the Langevin diffusion, but under quite strict conditions. The potential function U should be (globally) Lipschitz, smooth, and have polynomial growth. We do not claim that convergence results are not possible without these conditions, but we did explicitly use these conditions in our proofs.

1.6. The Ornstein-Uhlenbeck process

As an illustrating example, we consider the Ornstein-Uhlenbeck process as an example. This process allows for explicit calculations. Moreover, this process is also important in the statistical analysis later on, as its stationary distribution is the multivariate Gaussian distribution. The Ornstein-Uhlenbeck process is defined as the solution of the SDE

dXt= −A(Xt− µ) dt +

√

2 dWt, (1.10)

where µ ∈ Rdis the long-term mean, and A is a positive definite symmetric matrix that determined the speed of reversion to the mean. Note that there are some slight variations in the literature in the definition of the Ornstein-Uhlenbeck process. Our definition is such that it fits nicely in the framework of the Langevin diffusion. Indeed, if we take U (x) = −1₂(x − µ)>A(x − µ), then X is the Langevin diffusion with potential function U .

There are several ways to find the stationary distribution. First of all, we will explicitly solve the SDE (1.10) using matrix exponentials, and find its limiting behaviour. Let Yt = Xt− µ, then dYt = −AYtdt +

√

2 dWt. Applying the Ito formula to etAYt, and

using that _dtdetA= AeAt= etAA, we get

d(etAYt) = etAdYt+ etAAYtdt = −etAAYtdt + √ 2etAdWt+ etAAYtdt = √ 2etAdWt.

Integrating this expression, we get

Xt= µ + e−tA(X0− µ) + √ 2e−tA Z t 0 esAdWs. (1.11)

If X0 is deterministic, then we know that Xtis normally distributed, since the integrand

in the stochastic integral is deterministic. We get EXt= µ + e−tA(X0− µ)

(21)

and, using Ito’s isometry, Cov(Xt) = 2e−2tA

Z t

0

e2sAds = e−2tAA−1(e2tA− I) = A−1(I − e−2tA).

Here we used that etA is symmetric and invertible, and that Ak commutes with etA for all k ∈ Z. The set of eigenvalues of e−2tA is {e−2λt : λ is an eigenvalue of A}. Let λ∗ > 0 be the smallest eigenvalue of A. Since the operator norm of a positive-definite symmetric matrix equals the largest eigenvalue, we have

||e−2tA|| = e−2λ∗t_{= α}t_,

where 0 < α := e−2λ∗ < 1. It follows that for every x ∈ Rd, we have 0 ≤ ||e−2tAx|| ≤ αt||x|| so lim_t→∞e−2tAx = 0. In particular, EXt → µ and Cov(Xt) → A−1. Since

the density of the multivariate normal distribution is continuous in its parameters, the density of Xt converges point-wise to the density of Nd(µ, A−1). Alternatively, we can

also apply Theorem 1.21 and note that the distribution with density proportional to exp(−U ) is Nd(µ, A−1). One can also directly show that the normal distribution is

stationary by taking X0∼ Nd(µ, A−1) independent of W . Then we observe from (1.11)

that Xt is normally distributed with mean µ and covariance

Cov(Xt) = e−2tAA−1+ A−1(I − e−2tA) = A−1.

Again stationarity follows since the normal distributions is characterized by its mean and covariance. From Figure 1.1, we see that convergence indeed takes place.

Figure 1.1.: The left plot shows some paths of the 1-dimensional Ornstein-Uhlenbeck process with parameters µ = 1 and A = 1/4. On the right, a histogram of X(T ) for T = 50 is compared with the density of the N (1, 4)-distribution.

(22)

2. Posterior contraction rates: a testing

approach

In this chapter we study the classical approach of establishing posterior convergence. The main goal is to formulate a result that can be used in sequences of statistical models on unbounded Euclidean parameter spaces with increasing dimension. In Section 2.1, we discuss some notions from (Bayesian) statistics that we need in this chapter. Next, in Section 2.2 we prove in detail a general contraction rate theorem, due to Ghosal, Ghosh and Van der Vaart [3]. Finally, in Section 2.3 we adapt this result to the statistical setting we are interested. We formulate a useful result that can be used in unbounded Euclidean parameter spaces where the dimension is allowed to increase with the sample size.

2.1. Basic statistical notions

In this section, we formally introduce the statistical setting. We first briefly explain the elements of Bayesian statistics. Let P = {Pθ : θ ∈ Θ} be a set of probability measures on

some measurable space X, indexed by some set Θ. In most applications, we will take Θ ⊆ Rd, but for now Θ can be a more general set. We call X the sample space, Θ the parameter space and P the (statistical) model. We assume P is dominated, so there exists a σ-finite measure λ on X such that Pθ λ for all θ ∈ Θ. Throughout, we write pθ = dPθ/dλ

for the Radon-Nikodym derivative or density. Since a density completely describes the distribution, we can also regard P as a set of probability densities. Formally, we identify densities that are equal λ-almost everywhere. We endow Θ with a σ-algebra B (e.g. the Borel σ-algebra). In Bayesian statistics, a prior Π is a probability distribution on Θ. The data X is a random variable on X whose conditional distribution is given by X|θ ∼ Pθ. The posterior distribution Π(·|X) is the conditional distribution of θ given

X.

In this thesis, we will mostly concern sequences of independent and identically distributed (i.i.d.) random variables X1, . . . , Xn. We will also index the prior Πn by the sample size

n. If pθ(x) is jointly measurable (which we will always assume to be the case), then

Bayes’ rule asserts that the posterior satisfies

dΠn(θ|X1, . . . , Xn) ∝ n

Y

i=1

pθ(Xi)dΠn(θ) (2.1)

In this chapter we will analyze the posterior from a frequentist perspective. We will assume there is some true underlying parameter θ0, that is X1, . . . , Xn

i.i.d.

(23)

that the posterior probability Πn(B|X1, . . . , Xn) is a random variable measurable in

X1, . . . , Xn, so its distribution depends only on Pθ0.

We will frequently make use of tests, which are measurable functions φ : Xn → [0, 1]. Given a null hypothesis Θ0 ⊆ Θ and alternative hypothesis Θ1 ⊆ Θ, the outcome

φ(X1, . . . , Xn) of a test is the probability of rejecting the null hypothesis. A Type I

error occurs if the null hypothesis is rejected while θ0 ∈ Θ0, and its probability under

Pθ0 equals P

n

θ0φ. A Type II error occurs if the null hypothesis is not rejected when in

fact Θ1 contains the underlying parameter. A good test has low error probabilities. One

way to measure the quality of a test is through the expression sup

θ∈Θ0

P_θnφ + sup

θ∈Θ1

P_θn(1 − φ).

2.2. A general contraction rate theorem

In this section, we prove a general contraction rate theorem, due to Ghosal, Ghosh and Van der Vaart. The results in this section are taken from [3] without major modifications. In particular, Lemma 2.2, Lemma 2.4 and Theorem 2.5 in this section correspond to Theorem 7.1, Lemma 8.1 and Theorem 2.4 in [3], respectively. The general outline is as follows. First, consistent tests of the truth against convex alternatives with respect to the Hellinger distance always exist. In fact, this is the only property that we will use of the Hellinger distance, and thus any other distance with this property could be used. We extend this to non-convex alternatives by covering the alternative (which is the complement of some neighborhood of the truth) with balls. An entropy condition ensures that the number of balls that we need is not too large. For an introduction to Hellinger distance we refer to Appendix A. Since the contents of this section are not restricted to parametric models, we mostly write P instead of Pθ and let P0 be the true

underlying distribution. However, it is essential that the model is dominated. We write p and p0 for the corresponding densities. Throughout, h(P, Q) will denote the Hellinger

distance between probability measures P and Q. The following lemma concerns convex alternative hypotheses.

Lemma 2.1. There exists a universal constant K > 0 such that for every P0, P1 ∈ P

there are tests φn that satisfy

P₀nφn≤ exp(−Knh2(P0, P1))

and

sup

h(P,P1)<h(P0,P1)/2

Pn(1 − φn) ≤ exp(−Knh2(P0, P1)).

Proof. Let H0 and H1 be two sets of probability measures. By Lemma 4 In Chapter

16.4 of [7], the minimax testing error

min φ max P ∈H0 Pnφ + max Q∈H1 Qn(1 − φ)

(24)

between the null hypothesis H0 and alternative hypothesis H1 is bounded by ρn, where

ρ is the maximum Hellinger affinity of the null hypothesis and the alternative. The Hellinger affinity between two densities is defined as

ρ(p, q) = Z _√ pq dµ and satisfies ρ(p, q) = 1 −1 2h 2_{(p, q).}

In our case H0 = {P0} and H1 = {P : h(P, P1) < h(P0, P1)/2}. Let P ∈ H1, then

h(P0, P ) > h(P0, P1)/2, so it follows that

ρ(p0, p) ≤ 1 −

h2(P0, P1)

8 . Using the inequality 1 − x ≤ exp(−x), we get

ρn≤ 1 −h 2_(P 0, P1) 8 n ≤ exp −nh 2_(P 0, P1) 8 . The result follows by taking K = 1/8.

The following (non-asymptotic) lemma involves an entropy condition. See Appendix B for an introduction to this concept. The packing number D(ε, Q, d) of some set Q is the maximum number of points we can fit in Q such that the distance d between each pair of points is at least ε. In the lemma, we will cover the annuli around the truth with balls and use the previous result.

Lemma 2.2. Suppose D : [0, ∞) → [0, ∞) is a non-increasing function, εn≥ 0 and that

for every δ > εn we have

D δ

2, {P ∈ P : δ ≤ h(P, P0) ≤ 2δ} , h

≤ D(δ). (2.2) Then for every ε > εn there exist tests φn (depending on ε) and a universal constant

K > 0 such that for all j ∈ N

P₀nφn≤ D(ε) exp(−Knε2) 1 − exp(−Knε2₎ (2.3) and sup h(P,P0)>jε Pn(1 − φn) ≤ exp(−Knε2j2). (2.4)

Proof. Let ε > εnbe given and, for every j ∈ N, let Sj = {P : jε < h(P, P0) ≤ (j + 1)ε}.

Let S_j0 = {Pj,1, . . . , Pj,Nj} be a maximal jε/2-packing of Sj. This implies the following

three facts. First, for all i

(25)

Second, for every i 6= k

h(Pj,i, Pj,k) ≥ jε/2 (2.6)

Third, for every P ∈ Sj there exists i such that

h(P, Pj,i) < jε/2. (2.7)

Since also h(Pj,i, P0) ≤ 2jε (assuming j ≥ 1), we have that |Sj0| = Nj ≤ D(εj), because

of the entropy condition (2.2). By Lemma 2.1, for every i = 1, . . . , Nj there exists a test

φn,j,i such that

P₀nφn,j,i≤ exp(−Knh2(P0, Pj,i)) (2.8)

and

sup

h(P,Pj,i)<h(P0,Pj,i)/2

Pn(1 − φn) ≤ exp(−Knh2(P0, Pj,i)). (2.9)

Let φn= supjmaxi=1,...,Njφn,j,i, which means that φn rejects P0 if at least one of φn,j,i

rejects. Using (2.8), we get

P₀nφn= P0n sup j max i=1,...,Nj φn,j,i ! ≤ P₀n   ∞ X j=1 Nj X i=1 φn,j,i   = ∞ X j=1 Nj X i=1 P₀nφn,j,i ≤ ∞ X j=1 Nj X i=1 exp(−Knh2(P0, Pj,i)) ≤ ∞ X j=1 Nj X i=1 exp(−Knj2ε2) ≤ ∞ X j=1 D(εj) exp(−Knj2ε2) ≤ D(ε) ∞ X j=1 exp(−Knj2ε2) ≤ D(ε) ∞ X j=1 exp(−Knjε2) = D(ε) exp(−Knε 2₎ 1 − exp(−Knε2₎,

where we use (2.5) and the fact that D is non-increasing. Note that ∪i≥jSi = {P :

h(P, P0) > jε}. Also, φn,j,i tests against B(Pj,i, jε/2), and these balls indeed cover

∪_i≥jSi, see (2.7). Therefore, (2.9) implies that

sup h(P,P0)>jε Pn(1 − φn) ≤ sup P ∈∪i≥jSi Pn(1 − φn,j,i) ≤ sup i≥j max k=1,...,Ni sup h(P,Pi,k)<jε/2 Pn(1 − φn,j,i) ≤ sup i≥j max k=1,...,Ni exp(−Knh2(P0, Pi,k)) ≤ sup i≥j exp(−Kni2ε2) ≤ exp(−Knj2ε2), as we wanted.

(26)

Note that the previous proof did not make use of the speed of the sequence εnat all, as

it was non-asymptotic in n. To state the general theorem, we first define a neighborhood in terms of Kullback-Leibler divergence and a related distance.

Definition 2.3. For ε > 0, define

B(ε) := ( P ∈ P : −P0 log p p0 ≤ ε2, P0 logp0 p 2 ≤ ε2 ) .

The next lemma gives a probability bound for probability measures on B(ε). Lemma 2.4. Let ε > 0 and Π a probability measure on B(ε). Then for every C > 0

P₀n Z n Y i=1 p p0 (Xi) dΠ(P ) ≤ exp(−(1 + C)nε2) ! ≤ 1 C2_nε2.

Proof. Let Gn be the empirical process operator, that is

Gnf (X) := √ n 1 n n X i=1 f (Xi) − Z f (x) dP0(x) ! .

Under P0, Gnf (X) has mean zero and its variance is equal to the variance of f (X). We

enlarge the event of interest by applying Jensen’s inequality to the logarithm. This gives ( Z n Y i=1 p p0 (Xi) dΠ(P ) ≤ exp(−(1 + C)nε2) ) = ( log Z n Y i=1 p p0 (Xi) dΠ(P ) ≤ −(1 + C)nε2 ) ⊆ ( _n X i=1 Z log p p0 (Xi) dΠ(P ) ≤ −(1 + C)nε2 ) = Gn Z log p p0 (Xi) dΠ(P ) ≤ − √ n(1 + C)ε2−√nP0 Z log p p0 (X) dΠ(P ) ⊆ Gn Z log p p0 (Xi) dΠ(P ) ≤ − √ nCε2 ,

(27)

measures and Chebychev’s inequality thus give P₀n Z n Y i=1 p p0 (Xi) dΠ(P ) ≤ exp(−(1 + C)nε2) ! ≤ Pn 0 Gn Z log p p0 (Xi) dΠ(P ) ≤ − √ nCε2 ≤ 1 nC2_ε4VarP0 Z log p p0 (Xi) dΠ(P ) ≤ 1 nC2_ε4P0 Z log p p0 (Xi) dΠ(P ) 2 ≤ 1 nC2_ε4P0 Z log p p0 (Xi) 2 dΠ(P ) ≤ 1 nC2_ε2,

where we applied Jensen’s inequality to the square, and used that the support of Π lies in B(ε).

Before we state the general theorem, it is important to note that formally, our setting is different from the one in [3]. We want to include the dimension d in the convergence rate. To make sense of asymptotic statements, we need to assume that d depends on n. Therefore, P₀n will not be the n-fold product of some probability measure P0 that is

independent of n. Instead, P0 depends on d, so also on n. Even the parameter space

indirectly depends on n. All the previous results in this section are statements for fixed n, so no problem occurs. Although the final statement in the next theorem is asymptotic, the original proof still goes through. The majority of the proof is still for fixed n. When we finally take the limit n → ∞, we only make use of condition (2.11), and other limits follow from explicit bounds that we found for fixed n.

Theorem 2.5. Let εn be a sequence such that limn→∞εn= 0 and lim infn→∞nε2n> 0.

Assume there exist fixed J > 0 and a sequence (Pn) of subsets Pn ⊆ P such that for

every ε ≥ εn and for all j ≥ J the following conditions hold:

log Dε 2, {P ∈ Pn: ε ≤ h(P, P0) ≤ 2ε}, d ≤ nε2 n, (2.10) Πn(P \ Pn) Πn(B(εn)) = o exp −2nε2_n , (2.11) Πn(P : jεn< h(P, P0) ≤ 2jεn) Πn(B(εn)) ≤ exp Knε2_nj2/2 . (2.12) Here K is the universal constant that appears in Lemma 2.1. Then for every sequence (Mn) such that limn→∞Mn= ∞ it holds that

Πn(P : h(P, P0) ≥ Mnεn|X1, . . . , Xn) Pn

0

(28)

Proof. Let n ∈ N be given. By (2.10), the assumptions of Lemma 2.2 are satisfied with D(ε) = nε2_n(so D is a constant function). Let M ≥ 1 be a constant (independent of n), to be chosen later. Taking ε = M εn in Lemma 2.2, there exist tests φn such that

P₀nφn≤ exp(nε2n)

exp(−KnM2ε2_n) 1 − exp(−KnM2_ε2

n)

(2.14) and for all j ∈ N

sup

{P ∈Pn:h(P,P0)>M jεn}

Pn(1 − φn) ≤ exp(−KnM2ε2nj2). (2.15)

Let Sn,j = {P ∈ Pn: M εnj < h(P, P0) ≤ M εn(j + 1)}, then it follows from (2.15) that

P₀n " Z Sn,j n Y i=1 p p0 (Xi) dΠn(P )(1 − φn) # = Z Sn,j P₀n " _n Y i=1 p p0 (Xi)(1 − φn) # dΠn(P ) = Z Sn,j Z Xn n Y i=1 p p0 (Xi)(1 − φn) n Y i=1 p0(Xi) dλndΠn(P ) = Z Sn,j Z Xn n Y i=1 p(Xi)(1 − φn) dλndΠn(P ) = Z Sn,j Pn(1 − φn) dΠn(P ) ≤ exp(−KnM2ε2_nj2)Πn(Sn,j). (2.16)

Next, fix some C0 ≥ 1 and consider the probability measure obtained by restricting Πn

to B(εn) and then normalizing it. Then by Lemma 2.4 there exists an event An with

P₀n-probability at least 1 − (nε2_nC₀2)−1 on which Z n Y i=1 p p0 (Xi) d Πn Πn(B(εn)) (P ) ≥ exp(−(1 + C0)nε2n), which implies Z n Y i=1 p p0 (Xi) dΠn(P ) ≥ exp(−2C0nε2n)Πn(B(εn)). (2.17)

Using Bayes’ rule, we then get for J ∈ N that

P₀n[Πn(P ∈ Pn: h(P, P0) > J εnM |X1, . . . , Xn)(1 − φn)1An] = P₀n " 1An (1 − φn) R {P ∈Pn:h(P,P0)>J εnM } Qn i=1p(Xi) dΠn(P ) R Qn i=1p(Xi) dΠn(P ) # = P₀n " 1An (1 − φn) R {P ∈Pn:h(P,P0)>J εnM } Qn i=1(p/p0)(Xi) dΠn(P ) R Qn i=1(p/p0)(Xi) dΠn(P ) # . (2.18)

(29)

The denominator can be bounded from above by a deterministic term, as in (2.17). Continuing with the numerator only, we use (2.16) to get

P₀n " 1An(1 − φn) Z {P ∈Pn:h(P,P0)>J εnM } n Y i=1 (p/p0)(Xi) dΠn(P ) # ≤ P₀n " 1An(1 − φn) Z {P ∈Pn:h(P,P0)>J εnM } n Y i=1 p p0 (Xi) dΠn(P ) # ≤ P₀n  (1 − φn) ∞ X j=J Z Sn,j n Y i=1 p p0 (Xi) dΠn(P )   = ∞ X j=J P₀n " (1 − φn) Z Sn,j n Y i=1 p p0 (Xi) dΠn(P ) # ≤ ∞ X j=J exp(−KnM2ε2_nj2)Πn(Sn,j).

If we combine the above with (2.18) and the bound on the denominator (2.17), we thus get P₀n[Πn(P ∈ Pn: h(P, P0) > J εnM |X1, . . . , Xn)(1 − φn)1An] ≤ ∞ X j=J exp(−KnM2ε2_nj2)Πn(Sn,j) exp(−2C0nε2n)Πn(B(εn)) .

Applying (2.12) with j replaced by M j, since j + 1 ≤ 2j, it follows that lim J →∞P n 0 [Πn(P ∈ Pn: h(P, P0) > J εnM |X1, . . . , Xn)(1 − φn)1An] ≤ lim J →∞ ∞ X j=J exp(−KnM2ε2_nj2) exp(Knε2_nM2j2/2) exp(−2C0nε2n) ≤ exp(2C₀nε2_n) lim J →∞ ∞ X j=J exp(−KnM2ε2_nj/2) = 0, (2.19)

for M sufficiently large so that the geometric series converges. This is possible since nε2_n> 0. Using the same technique for the denominator, we also get

P₀n[Πn(P /∈ Pn|X1, . . . , Xn)(1 − φn)1An] ≤ 1 exp(−2C0nε2n)Πn(B(εn)) P₀n " Z P\Pn n Y i=1 p p0 (Xi) dΠn(P ) # = 1 exp(−2C0nε2n)Πn(B(εn)) Z P\Pn P₀n " _n Y i=1 p p0 (Xi) # dΠn(P ) = 1 exp(−2C0nε2n)Πn(B(εn)) Πn(P \ Pn). (2.20)

(30)

We are ready to finish the argument. Let γ, δ > 0. Choose C0 ≥ 1 large enough such

that P₀n(An) ≥ 1 − γδ/4 for all n ≥ N1. This is possible because nε2n is bounded away

from 0 (for n large enough). By assumption (2.11), (2.20) converges to 0 as n → ∞, i.e. it is smaller than δγ/4 for n ≥ N2. Since Mn→ ∞, we can also take N3 such that

Mn≥ J M for n ≥ N3. Also suppose that the expression in the limit in (2.19) is smaller

than δγ/4 for n ≥ N4, and assume that the Type I error (2.14) is smaller than δγ/4

for n ≥ N5 (if necessary, take M larger). Using Markov’s inequality, it follows that for

n ≥ N := max(N1, N2, N3, N4, N5), P₀n(Πn(h(P, P0) ≥ Mnεn|X1, . . . , Xn) > γ) ≤ γ−1P₀n[Πn(h(P, P0) ≥ Mnεn|X1, . . . , Xn)] ≤ γ−1P₀n[Πn(h(P, P0) ≥ J M εn|X1, . . . , Xn)] ≤ γ−1P₀n[Πn(P ∈ Pn: h(P, P0) ≥ J M εn|X1, . . . , Xn)] + γ−1P₀n[Πn(P \ Pn|X1, . . . , Xn)(1 − φn)1An] + γ−1P₀n1Xn_\A n + γ −1 P₀n[φn] ≤ δ. We conclude that Πn(h(P, P0) > Mnεn|X1, . . . , Xn) Pn 0 → 0.

2.3. The finite-dimensional case

In Section 5 of [3], the authors apply Theorem 2.5 to show that, under some conditions, the posterior distribution in finite-dimensional models contracts at rate 1/√n. However, in their approach it is essential that d is fixed. Also, they require the parameter space to be bounded. However, if d increases with n, the parameter space also increases with n, so a uniformly bounded parameter space has no clear meaning anymore. It turns out that for increasing dimension, we need to add a logarithmic factor on top, so we can only obtain a slightly slower rate εn= Mpd log n/n. This allows us to make a suitable

sieve construction. In practice, this logarithmic factor does not make a big difference. In this section, we will assume that d is non-decreasing in n and d = o(n/ log n). We first need a preliminary result on covering numbers in Rd_.

Lemma 2.6. For every ε > 0 and 2l > k > 0 we have

D(kε, {x ∈ Rd: ||x − a|| ≤ lε}, || · ||) ≤ 6l k

d

(2.21) Proof. This proof is adapted from Lemma 4.1 in [14]. Let x1, . . . , xm be a kε-packing of

B(a, lε). Then ||xi− xj|| > kε for all i 6= j, so the balls B(xi, kε/2) are disjoint. Take

x ∈ B(xi, kε/2) for some i, then

||x − x₁|| ≤ ||x − x_i|| + ||x_i− x|| ≤ kε

(31)

So the balls B(xi, kε/2) are contained in B(x1, 3lε). We know that the size of a ball

with radius r equals rd times the size V of the unit ball in Rd. Therefore, m kε

2 d

V ≤ (3lε)dV, and the result follows.

The parameters in Theorem 2.5 give some freedom in the exact result one wants to prove, especially when making use of the sieves Pn. We have formulated the following

theorem in such a way that the conditions on the prior will at least be satisfied by a normal prior.

Theorem 2.7. Suppose that ||θ0|| = O(

√

d) and that there exists some constant C > 0 such that for all θ with ||θ − θ0|| ≤ 1 we have

max −P_θ₀ log pθ pθ0 , Pθ0 log pθ pθ0 2! ≤ C||θ − θ₀||2 (2.22) Also, assume that for all θ1, θ2∈ Rdwith ||θ1||, ||θ2|| ≤ K, there exist constants a, A > 0

such that

a

K||θ1− θ2|| ≤ h(pθ1, pθ2) ≤ A||θ1− θ2||. (2.23)

Next, assume that the prior Πn satisfies the tail bound

Πn({θ ∈ Rd: ||θ|| ≥ K}) ≤ Be−K

2_/4

2d/2 (2.24) for some constant B > 0. Finally, assume that there exist small constants α, β > 0 such that the prior density π satisfies

π(θ) ≥ αde−βK2 (2.25) whenever ||θ|| ≤ K. Then Πn ||θ − θ0|| ≥ Mn r d log n n X1, . . . , Xn ! Pn θ0 → 0 for every Mn→ ∞.

Proof. Set εn = Mpd log n/n for some large constant M . Then nε2n = M2d log n. In

particular, εn→ 0 and nε2n is bounded away from 0. We check conditions (2.10), (2.11)

and (2.12) of Theorem 2.5. We set Pn= {Pθ: ||θ|| ≤ n}. Assume n is large enough such

that Pθ0 ∈ Pn. For (2.10), we use assumption (2.23) and Lemma 2.6 to find

Dε 2, {P ∈ Pn: ε ≤ h(P, P0) ≤ 2ε}, h ≤ Dε 2, n θ ∈ Rd: a n||θ − θ0|| ≤ 2ε o , A|| · || = D ε 2A, θ ∈ Rd: ||θ − θ0|| ≤ 2nε a , || · || ≤ 24An a d .

(32)

It follows that log Dε 2, {P ∈ Pn: ε ≤ h(P, P0) ≤ 2ε}, h ≤ d log 24An a ≤ M2d log n = nε2_n, if we take M > 0 large enough. This solves condition (2.10). For the second condition, we first use (2.24) to find

Πn(P \ Pn) ≤ Be−n

2_/4

2d/2. (2.26) Now suppose n is large enough such that εn< 1. Since ||θ0||2 = O(d), there exists some

large constant L > 0 such that

{θ ∈ Rd: ||θ − θ∗|| ≤ εn} ⊆ {θ ∈ Rd: ||θ|| ≤ Ld1/2}.

We then use condition (2.22) to lower bound the denominator Πn(B(εn)) ≥ Πn ||θ − θ0||2 ≤ C−1ε2n ≥ Z {||θ−θ0||2≤C−1ε2n} π(θ) dθ ≥ C−d/2εd_nVol(Bd(0, 1))αde−βL 2_d = C −d/2_εd nπd/2αde−βL 2_d Γ d₂ + 1 Taking the quotient, we thus find

Πn(P \ Pn) Πn(Bn(εn)) ≤ Be −n2_/4 2d/2Γ d₂ + 1 C−d/2εd nπd/2αe−βLd ≤ Re−n2/8,

for some large constant R > 0. The existence of this constant follows since eventually e−n2/4 will dominate the other factors. Also note that the sequence we are left with is o(n−2M d), hence condition (2.11) follows. For (2.12), we can just bound the numerator by 1. We thus get 1 Πn(B(εn)) ≤ C d/2_Γ d 2 + 1 e2βd εd nπd/2α ≤ RnRd

for some new large constant R > 0. Since in our case the right-hand side of (2.12) reduces to ndKj2/2, the inequality can be seen to hold for j sufficiently large.

We show that a Gaussian prior will satisfy the prior mass conditions in Theorem 2.7. Proposition 2.8. Let Πn= Nd(0, Id), then (2.24) and (2.25) hold.

Proof. For (2.24), we use a Hoeffding bound. Using that ||θ||2 ∼ χ2_{(d), it is trivial to}

show that Ee||θ||2/4= 2d/2. Using Markov’s inequality, we get Π(||θ|| ≥ K) = Π(e||θ||2/4≥ eK2/4) ≤ e−K2/4Ee||θ||

2_/4

= e−K2/42d/2. The prior mass condition (2.25) follows by taking α = (2π)−1/2 and β = 1/2.

(33)

3. Posterior contraction rates: a

diffusion approach

In this chapter, we study the behavior of the posterior distribution using a technique based on the Langevin diffusion. In Section 3.1, we use the results of Chapter 1 to prove a result that quantifies the level of concentration of the posterior distribution. In Section 3.2, this result is compared to the result obtained in Chapter 2 through the classical approach.

3.1. The main result

The material in this section is based on the unpublished paper [11]. There are however some significant differences in the conditions and proof. We consider a rescaled version of the Langevin diffusion

dθt= −∇U (θt) dt +

r 2

β dWt. (3.1) with initial value θ0, and β > 0. We restate the main result of Chapter 1.

Proposition 3.1. Suppose U is Lipschitz, smooth, and there exist constants k, c, C > 0 such that for θ large enough

hθ, U (θ)i ≥ c||θ||2k. (3.2) and

|∆U (θ)| ≤ C||θ||2k−2 _(3.3)

Then for each t > 0, θtadmits a smooth density y 7→ pt(y) and there exists a subsequence

0 = t0< t1< . . . such that pta.e.→ = Z−1exp(−βU ) as k → ∞, where Z =R exp(−βU ).

Proof. Let Xt= √ βθt, then dXt= −∇V (Xt) dt + √ 2 dWt,

where V (x) = βU (x/√β). It is clear that V satisfies the assumptions of Theorem 1.24. In particular, Xthas a smooth density qtsuch that limn→∞qtn = W

−1_{exp(−V ) a.e., for}

some sequence (tn), where W =R exp(−V ). We know by an elementary transformation

formula that pt(θ) = √ βqt( √ βθ), so clearly ptk a.e. → √βW−1exp(−βU ) as k → ∞, as we wanted.

(34)

We consider a statistical model P = {Pθ : θ ∈ Rd} and assume that P is dominated.

As before, we write lower-case letters for densities. Let X1, . . . , Xn i.i.d. ∼ P_θ₀. Define Fn, F : Θ → R by Fn(θ) = 1 n n X i=1 log pθ(Xi) and F (θ) = Eθ0log pθ(X1).

Also, let Π be a prior distribution on Rd _{with positive density π. Then by Bayes’ rule,}

the posterior distribution of θ given X1, . . . , Xn|θ i.i.d.

∼ P_θ has a density that satisfies

π(θ|X1, . . . , Xn) ∝ n

Y

i=1

pθ(Xi)π(θ) = enFn(θ)+log π(θ). (3.4)

Let δ ∈ (0, 1) be given. The main result of this section will be non-asymptotic in n (and d). Since we will keep track of all constants, the constants may in principle depend on n, d, δ, θ0, the prior, and the statistical model. In fact, the constants may also depend on

the data X1, . . . , Xn, but this will usually only be the case for L3. We have the following

conditions on the prior and the model.

Assumption 3.2. There are constants L1, L2, L3, c, B, µ, ε1, ε2> 0 such that on a single

event with probability at least 1 − δ, the following are true. The functions ∇F, ∇ log π and ∇Fn are Lipschitz, i.e. for all θ1, θ2∈ Θ

||∇F (θ₁) − ∇F (θ2)|| ≤ L1||θ1− θ2|| (3.5)

||∇ log π(θ₁) − ∇ log π(θ2)|| ≤ L2||θ1− θ2|| (3.6)

||∇Fn(θ1) − ∇Fn(θ2)|| ≤ L3||θ1− θ2||. (3.7)

There exist constants k, c, C > 0 such that, for all θ outside some compact set,

hθ, ∇ log π(θ)i ≤ −c||θ||2k (3.8) |∆ log π(θ)| ≤ C||θ||2k−2 (3.9) hθ, ∇F_n(θ)i ≤ −c||θ||2k (3.10) |∆F_n(θ)| ≤ C||θ||2k−2 (3.11) Also, log π and Fn are smooth. For all θ ∈ Rd

h∇ log π(θ), θ − θ₀i ≤ B. (3.12) Next, F is strongly concave, that is

h∇F (θ), θ₀− θi ≥ µ||θ − θ₀||2. (3.13) Finally, for all θ ∈ Rd

(35)

In practical examples, condition (3.14) will be the hardest to check. Especially in non-linear models, one might need heavy machinery from empirical process theory to check this condition. Conditions (3.5), (3.7) and (3.13) give some limitations on the applicability of the main theorem. Since we can always try to choose a prior with desirable properties, condition (3.12) is not our biggest concern for now. Conditions (3.8), (3.9), (3.10) and (3.11) are necessary to show the convergence of the Langevin diffusion. These conditions are not present in [11], but some lower bound on the growth of −Fn and − log π will be necessary. Condition (3.7) is also missing in [11], and it is

possible that a global Lipschitz condition is not necessary. If a local Lipschitz condition would suffice, we will in any case need a condition that limits the growth of Fn for large

θ.

In general, B and µ may very well depend on d, n and θ0. The constants L1, L2, L3 and

c will not appear in the final result, so we are not interested in their values. However, for the next theorem to have any significance, we want ε2 ∼pd/n, and ε1 ↓ 0. Without loss

of generality, we can take ε1 and ε2to be decreasing in δ. We need one more preliminary

lemma.

Lemma 3.3. Suppose Assumptions 3.2 hold. Define U (θ) = −1

2Fn(θ) − 1

2nlog π(θ) (3.15) and set β = 2n. Then the requirements of Proposition 3.1 hold. It follows that there exists an increasing sequence (tk) such that the densities of θtk in (3.1) converge to the

posterior density, with probability at least 1 − δ.

Proof. It is clear from Bayes’ rule (3.4) that the density proportional to exp(−βU ) equals the posterior density. Furthermore, (3.10) and (3.8) imply (3.2), and (3.11) and (3.9) imply (3.3), Similarly, conditions (3.6) and (3.7) imply that U is Lipschitz.

Note that in the next result, δ > 0 is both a confidence- and accuracy parameter. The proof and final result are slightly different from [11], as not all their steps were clear. Among other things, we only consider the second moment of the Langevin diffusion, as opposed to pth moments.

Theorem 3.4. Suppose Assumptions 3.2 hold, and assume that 2ε1< µ, then

Π ||θ − θ0|| ≥ 2 √ δ s B + d nµ + ε2 µ ! X1, . . . , Xn ! ≤ δ

with probability at least 1 − δ.

Proof. Assume that the data X1, . . . , Xn i.i.d.∼ Pθ0 are such that all probabilistic

state-ments in Assumptions 3.2 hold, which happens with probability at least 1 − δ. In the remainder of the proof, we treat the data (X1, . . . , Xn) as given, but formally, all

(36)

(3.1) with potential function (3.15) and initial condition θ0. The proof is divided into

three steps.

Step 1 (decomposition) For given α > 0, we decompose e2αt||θ_t− θ₀||2 _{into a local}

martingale and overhead. An application of Ito’s formula gives

e2αt||θt− θ0||2 = 2α Z t 0 e2αs||θs− θ0||2ds + 2 Z t 0 e2αshθs− θ0, dθsi + d n Z t 0 e2αsds = 2α Z t 0 e2αs||θ_s− θ₀||2_{ds +} Z t 0 e2αshθ_s− θ₀, ∇Fn(θs) + 1 n∇ log π(θs)i ds +√2 n Z t 0 e2αshθ_s− θ₀, dWsi + d n Z t 0 e2αsds.

Let dMt = 2e2αshθs− θ0, dBsi with M0 = 0. We add and subtract ∇F (θs) and use

Cauchy-Schwartz, (3.12) and (3.13) to bound the overhead. e2αt||θ_t− θ₀||2−√Mt n ≤ 2α Z t 0 e2αs||θ_s− θ₀||2_{ds +} Z t 0 e2αshθ_s− θ₀, ∇F (θs)i ds + Z t 0 e2αshθs− θ0, ∇Fn(θs) − ∇F (θs)i ds + 1 n Z t 0 e2αshθs− θ0, ∇ log π(θs)i ds + d n Z t 0 e2αsds ≤ 2α Z t 0 e2αs||θs− θ0||2ds − µ Z t 0 e2αs||θs− θ0||2ds + Z t 0 e2αs||θ_s− θ₀||||∇F_n(θs) − ∇F (θs)|| ds + B n Z t 0 e2αsds + d n Z t 0 e2αsds.

Using (3.14) and the fact that ε2x ≤ 1₂µx2+_2µ1 ε22 (since (x − ε2/µ)2 ≥ 0), we have

Z t 0 e2αs||θs− θ0||||∇Fn(θs) − ∇F (θs)|| ds ≤ Z t 0 e2αs||θs− θ0|| (ε1||θs− θ0|| + ε2) ds ≤ε1+ µ 2 Z t 0 e2αs||θ_s− θ₀||2ds + ε 2 2 2µ Z t 0 e2αsds Putting everything together (and computing the integral), we get

e2αt||θt− θ0||2 ≤ 2α − µ + ε1+ µ 2 Z t 0 e2αs||θs− θ0||2ds + B n + d n+ ε2₂ 2µ 1 2α(e 2αt_{− 1)} +√Mt n.

(37)

Setting α = µ/4 − ε1/2 > 0, the first term vanishes. (This is where the assumption on

ε1 is used.) We define Un= B/n + d/n + ε22/(2µ), so that

e2αt||θt− θ0||2 ≤ Mt √ n+ Un e2αt− 1 2α . (3.16) Step 2 (expectation) We want to argue that EMt = 0. This follows if M is a true

martingale, for which we need

E Z t

0

4e4αs||θs− θ0||2ds < ∞. (3.17)

Clearly, conditions (3.7) and (3.6) imply that the coefficients of the Langevin SDE are Lipschitz. From Theorem 5.2.9 of [6], it follows that there exists a constant C > 0 (depending on t) such that E||θs||2 ≤ CeCs, for all s ≤ t. It follows that

Z t 0 4e4αsE||θs− θ0||2ds ≤ Z t 0 4e4αs(2E||θs||2+ 2||θ0||2) ds ≤ 8C Z t 0 e(4α+C)sds + 8||θ0||2 Z t 0 e4αsds < ∞.

By Fubini’s theorem, we thus have

E Z t 0 4e4αs||θs− θ0||2ds = Z t 0 4e4αsE||θs− θ0||2ds < ∞

so M is indeed a martingale. From (3.16), we get

e2αtE||θt− θ0||2 ≤ Un e2αt− 1 2α , which implies E||θt− θ0||2≤ Un 1 − e−2αt 2α . By assumption α > 0, so letting t → ∞, we get

lim sup

t→∞ E||θt

− θ0||2 ≤

Un

2α. (3.18)

Step 3 (final steps) By Chebychev’s inequality, we have

Π(||θ − θ0|| > ρ|X1, . . . , Xn) ≤ ρ−2E[||θ − θ0||2|X1, . . . , Xn], (3.19)

for any ρ > 0. By Lemma 3.3, θt has a density ft : Rd → R such that some sequence

(38)

Fatou’s Lemma, E[||θ − θ0||2|X1, . . . , Xn] = Z ||θ − θ₀||2_π(θ|X 1, . . . , Xn) dθ = Z ||θ − θ0||2 lim k→∞ftk(θ) dθ ≤ lim inf k→∞ Z ||θ − θ₀||2ftk(θ) dθ = lim inf k→∞E||θtk − θ0|| 2_|X 1, . . . , Xn .

Combining this with (3.18) and (3.19), we get

Π(||θ − θ0|| > ρ|X1, . . . , Xn) ≤ ρ−2

Un

2α.

The right-hand side is at most δ, if and only if ρ is at least pUn/(2αδ). The final

statement follows since r Un 2αδ = r B + d 2αδn + s ε2 2 4µαδ ≤ 2 √ δ r B + d n + ε2 µ ! . using that α ≥ µ/8.

We note that the statement of this theorem is slightly different from that in [11]. Most importantly, they add a term of the order log(1/δ)/n, while we have 1/√δ in front. The reason for this seems to be that these authors instead expand ||θt− θ0||p for arbitrarily

large p, while we considered the second moment only. In most cases, their result is better for fixed n. However, for fixed δ the asymptotic result will be the same. We will see this in the next section.

3.2. Comparison of the two approaches

In this section, we compare the result of Chapter 2 (based on [3]) with the one of the previous section (based on [11]). First, we show that under suitable conditions, the final statement of Chapter 3 is stronger than that of Chapter 2. We implicitly assume that d depends on n and is increasing in n.

Theorem 3.5. Suppose that Assumptions 3.2 are satisfied for every δ > 0. Assume that as n → ∞, B = O(d), 1 µ = O(1), ε1 = o(1), ε2 = O r d n ! . (3.20) Then Π ||θ − θ0|| ≥ Mn r d n X1, . . . , Xn ! Pn θ0 → 0, (3.21) for any sequence Mn→ ∞.

A Diffusion Approach to Posterior Contraction Rates

MSc Mathematics

Master Thesis