• No results found

Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

N/A
N/A
Protected

Academic year: 2021

Share "Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23"

Copied!
23
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

Part I

N O N PA R A M E T R I C BAY E S I A N

D I R I C H L E T- L A P L A C E D E C O N VO L U T I O N

(3)
(4)

1

P O S T E R I O R C O N T R A C T I O N R AT E S F O R D E C O N V O L U T I O N O F D I R I C H L E T- L A PA L A C E M I X T U R E S

1.1 introduction

Consider statistical inference using the following nonparametric hi- erarchical Bayesian model for observations𝑋1, … , 𝑋𝑛:

(i) A probability distribution𝐺 on ℝ is generated from the Dirich- let process priorDP(𝛼) with base measure 𝛼.

(ii) An iid sample𝑍1, … , 𝑍𝑛is generated from𝐺.

(iii) An iid sample𝑒1, … , 𝑒𝑛is generated from a known density𝑓, independent of the other samples.

(iv) The observations are𝑋𝑖= 𝑍𝑖+ 𝑒𝑖, for𝑖 = 1, … , 𝑛.

In this setting the conditional density of the data𝑋1, … , 𝑋𝑛given𝐺 is a sample from the convolution

𝑝𝐺= 𝑓 ∗ 𝐺

of the density𝑓 and the measure 𝐺. The scheme defines a conditional distribution of𝐺 given the data 𝑋1, … , 𝑋𝑛, the posterior distribution of𝐺, and consequently also posterior distributions for quantities that derive from𝐺, including the convolution density 𝑝𝐺. We are inter- ested in whether this posterior distribution can recover a true mixing distribution𝐺0if the observations𝑋1, … , 𝑋𝑛are in reality a sample from the mixed distribution𝑝𝐺0, for some given probability distribu- tion𝐺0.

The main contribution of this chapter is for the case that𝑓 is the Laplace density𝑓(𝑥) = 𝑒−|𝑥|/2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy ([42]) and clustering problem in genetics ([7]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.

We consider two notions of recovery. The first notion measures the distance between the posterior of𝐺 and 𝐺0through the Wasser- stein metric

𝑊𝑘(𝐺, 𝐺) = inf

𝛾∈𝛤(𝐺,𝐺)(∫ |𝑥 − 𝑦|𝑘𝑑𝛾(𝑥, 𝑦))1/𝑘,

(5)

where𝛤(𝐺, 𝐺) is the collection of all couplings 𝛾 of 𝐺 and 𝐺into a bivariate measure with marginals𝐺 and 𝐺(i.e. if(𝑥, 𝑦) ∼ 𝛾, then 𝑥 ∼ 𝐺 and 𝑦 ∼ 𝐺), and𝑘 ≥ 1. The Wasserstein metric is a clas- sical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the to- tal variation distance (which is more natural as a distance on densi- ties), can be interpreted through transportation of measure (see [76]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by [61], within a general mixing framework. We refer to this paper for further motiva- tion of the Wasserstein metric for mixtures, and to [76] for general background on the Wasserstein metric. In this chapter we improve the upper bound on posterior contraction rates given in [61], at least in the case of the Laplace mixtures, obtaining a rate of nearly𝑛−1/8 for𝑊1(and slower rates for𝑘 > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribu- tion by non-Bayesian methods is given in [80]. It is not clear from our result whether the upper bound𝑛−1/8is sharp.

The second notion of recovery measures the distance of the pos- terior of𝐺 to 𝐺0indirectly through the Hellinger or𝐿𝑞-distances be- tween the mixed densities𝑝𝐺and𝑝𝐺0. This is equivalent to studying the estimation of the true density𝑝𝐺0of the observations through the density𝑝𝐺under the posterior distribution. As the Laplace kernel𝑓 has Fourier transform

̃𝑓(𝜆) = 1 1 + 𝜆2,

it follows that the mixed densities𝑝𝐺have Fourier transforms satisfy- ing

| ̃𝑝𝐺(𝜆)| ≤ 1 1 + 𝜆2.

Estimation of a density with a polynomially decaying Fourier trans- form was first considered in [77]. According to their Theorem in Sec- tion 3A, a suitable kernel estimator possesses a root mean square er- ror of𝑛−3/8with respect to the𝐿2-norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the usual rate𝑛−𝛼/(2𝛼+1)of nonparametric estimation for smoothness𝛼 = 3/2. This is understandable as | ̃𝑝(𝜆)| ≲ 1/(1 + |𝜆|2) implies that ∫(1 +

|𝜆|2)𝛼| ̃𝑝(𝜆)|2𝑑𝜆 < ∞, for every 𝛼 < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class

(6)

1.1 introduction

of regularity𝛼 < 3/2. Indeed in [34], the rate 𝑛−𝛼/(2𝛼+1)is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In this chapter we show that the posterior distribution of Laplace mixtures𝑝𝐺contracts to𝑝𝐺0at the rate𝑛−3/8up to a logarithm factor, relative to the𝐿2-norm and Hellinger distance, and also establish rates for other𝐿𝑞-metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate𝑛−1/3(the usual non- parametric rate for𝛼 = 1), which is slower than 𝑛−3/8, and hence this insight is misleading.

Besides recovery relative to the Wasserstein metric and the in- duced metrics on𝑝𝐺, one might consider recovery relative to a met- ric on the distribution function on𝐺. Frequentist recovery rates for this problem were obtained in [27] under some restrictions. There is no simple relation between these rates and rates for the other met- rics. The same is true for the rates for deconvolution of densities, as in [27]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of𝐺.

Contraction rates for Dirichlet mixtures of the normal kernel were considered in [31, 33, 44, 71, 72]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of ap- proximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass con- centration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further de- rive a relationship of these metrics with a power of the Hellinger dis- tance, and next apply a variant of the contraction theorem in [30], whose proof is included in the appendix of the dissertation. Contrac- tion rates of mixtures with other priors than the Dirichlet were consid- ered in [71]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A gen- eral approach to posterior contraction rates in inverse problems can be found in [41], and results specific to deconvolution can be found in [24]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, and hence their results are not directly comparable to the results in this dissertation.

The papers [28, 49] consider recovery of a mixing density relative

(7)

to the𝐿𝑝-norm in the frequentist setting. If the smoothness of the mixing density degenerates to0, then the minimax rate decreases to a constant and it is not possible to find a consistent estimator. In this chapter we show that in the same problem but viewed as a deconvolu- tion problem on distributions, endowed with the weaker Wasserstein distance, we may obtain polynomial rates for the mixing distribution without any smoothness assumption on the distribution. In particu- lar, for any mixing distribution it is possible to construct a consistent estimator.

The chapter is organized as follows. In the next section we give no- tation and preliminaries. We state in Section 1.3 the main results of the chapter, which are proved in the subsequent sections. In Section 1.4 we establish suitable finite approximations relative to the𝐿𝑞- and He- llinger distances. The𝐿𝑞-approximations also apply to other kernels than the Laplace kernel, and are in terms of the tail decay of the ker- nel’s characteristic function. In Sections 1.5 and 1.6 we apply these ap- proximations to obtain bounds on the entropy of the mixtures relative to the𝐿𝑞, Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 1.7 and 1.8 contain the proofs of the main results.

1.2 notation and preliminaries

Throughout the chapter integrals given without limits are considered to be integrals over the real lineℝ. The 𝐿𝑞-norm is denoted by

‖𝑔‖𝑞 = (∫ |𝑔(𝑥)|𝑞𝑑𝑥)1/𝑞,

with‖⋅‖being the uniform norm. The Hellinger distance on the space of densities is given by

ℎ(𝑓, 𝑔) = (∫(𝑓1/2(𝑥) − 𝑔1/2(𝑥))2𝑑𝑥)1/2.

It is easy to see thatℎ2(𝑓, 𝑔) ≤ ‖𝑓 − 𝑔‖1≤ 2ℎ(𝑓, 𝑔), for any two prob- ability densities𝑓 and 𝑔. Furthermore, if the densities 𝑓 and 𝑔 are uniformly bounded by a constant𝑀, then ‖𝑓 − 𝑔‖2 ≤ 2√𝑀ℎ(𝑓, 𝑔).

The Kullback-Leibler discrepancy and corresponding variance are de- noted by

𝐾(𝑝0, 𝑝) = ∫ log(𝑝0/𝑝) 𝑑𝑃0, 𝐾2(𝑝0, 𝑝) = ∫(log(𝑝0/𝑝))2𝑑𝑃0,

(8)

1.3 main results

with𝑃0the measure corresponding to the density𝑝0.

We are primarily interested in the Laplace kernel, but a number of results are true for general kernels𝑓. The Fourier transform of a function𝑓 and the inverse Fourier transform of a function ̃𝑓 are given by

̃𝑓(𝜆) = ∫ 𝑒𝚤𝜆𝑥𝑓(𝑥)𝑑𝑥, 𝑓(𝑥) = 1

2𝜋∫ 𝑒−𝚤𝜆𝑥𝑓(𝜆)𝑑𝜆.̃ For𝑝1 +1𝑞 = 1 and 1 ≤ 𝑝 ≤ 2, Hausdorff-Young’s inequality gives that

‖𝑓‖𝑞 ≤ (2𝜋)−1/𝑝‖ ̃𝑓‖𝑝.

The covering number𝑁(𝜀, 𝛩, 𝜌) of a metric space (𝛩, 𝜌) is the minimum number of𝜀-balls needed to cover the entire space 𝛩.

Throughout the chapter≲ denotes inequality up to a constant mul- tiple, where the constant is universal or fixed within the context. Fur- thermore𝑎𝑛≍ 𝑏𝑛means that for some positive constants𝑐 and 𝐶

𝑐 ≤ lim inf

𝑛→∞ 𝑎𝑛/𝑏𝑛≤ lim sup

𝑛→∞

𝑎𝑛/𝑏𝑛≤ 𝐶.

We denote by ℳ[−𝑎, 𝑎] the set of all probability measures on a given interval[−𝑎, 𝑎].

1.3 main results

Write𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) as the posterior distribution for 𝐺 in the scheme (i)-(iv) introduced at the beginning of the chapter. We study this random distribution assuming that𝑋1, … , 𝑋𝑛are an iid sample from the mixture density𝑝𝐺0 = 𝑓 ∗ 𝐺0, for a given probability dis- tribution𝐺0. We assume that𝐺0is supported in a compact interval [−𝑎, 𝑎], and that the base measure 𝛼 of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from0 and ∞.

Theorem 1.1. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then for every 𝑘 ≥ 1, there exists a constant 𝑀 such that

𝛱(𝐺 ∶ 𝑊𝑘(𝐺, 𝐺0) ≥ 𝑀𝑛8𝑘+163 (log 𝑛)𝑘+7/8𝑘+2 ∣ 𝑋1, … , 𝑋𝑛) → 0, (1.1) in𝑃𝐺0-probability.

The rate for the Wasserstein metric𝑊𝑘given in the theorem dete- riorates with increasing𝑘, which is perhaps not unreasonable as the Wasserstein metrics increase with𝑘. The fastest rate is obtained for

(9)

𝑊1at𝑛−1/8(log 𝑛)5/8.

Theorem 1.2. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then there exists a constant 𝑀 such that

𝛱𝑛(𝐺 ∶ ℎ(𝑝𝐺, 𝑝𝐺0) ≥ 𝑀(log 𝑛/𝑛)3/8∣ 𝑋1, … , 𝑋𝑛) → 0, (1.2) in𝑃𝐺0-probability. Furthermore, for every𝑞 ∈ [2, ∞) there exists 𝑀𝑞 such that

𝛱𝑛(𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺0𝑞≥ 𝑀𝑞(log 𝑛/𝑛)(𝑞+1)/(𝑞(𝑞+2))∣ 𝑋1, … , 𝑋𝑛) → 0, (1.3) in𝑃𝐺0-probability.

The rate for the𝐿𝑞-distance given in (1.3) deteriorates with increas- ing𝑞. For 𝑞 = 2 it is the same as the rate (log 𝑛/𝑛)3/8for the Hellinger distance.

In both theorems the mixing distributions are assumed to be sup- ported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.

1.4 finite approximation

In this section we show that a general mixture𝑝𝐺 can be approxi- mated by a mixture with finitely many components, where the num- ber of components depends on the accuracy of the approximation, the distance used, and the kernel𝑓. We first consider approximations with respect to the𝐿𝑞-norm, which applies to mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, for a general kernel𝑓, and next approximations with respect to the Hellinger distance for the case that𝑓 is the Laplace kernel. The first result generalizes the result of [31] for normal mixtures. Also see [71]

for results on Dirichlet mixtures of exponential power densities.

The result splits in two cases, depending on the tail behaviour of the Fourier transform ̃𝑓 of 𝑓:

-ordinary smooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)||𝜆|𝛽< ∞, for some 𝛽 > 1/2.

-supersmooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)|𝑒|𝜆|𝛽 < ∞, for some 𝛽 > 0.

Lemma 1.3 (Approximation Lemma). Let𝜀 < 1 be sufficiently small and fixed. For a probability measure𝐺 on an interval [−𝑎, 𝑎] and 2 ≤

(10)

1.4 finite approximation

𝑞 ≤ ∞, there exists a discrete measure 𝐺on[−𝑎, 𝑎] with at most 𝑁 support points in[−𝑎, 𝑎] such that

‖𝑝𝐺− 𝑝𝐺𝑞≲ 𝜀, where

(i) 𝑁 ≲ 𝜀−(𝛽−𝑝−1)−1if𝑓 is ordinary smooth of order 𝛽 with 𝛽 > 𝑝−1, for𝑝 and 𝑞 being conjugate (𝑝−1+ 𝑞−1 = 1).

(ii) 𝑁 ≲ (log 𝜀−1)max(1,𝛽−1)if𝑓 is supersmooth of order 𝛽.

Proof. The Fourier transform of𝑝𝐺is given by𝑓 ̃𝐺, where ̃̃ 𝐺 is the Fourier transform of𝐺 defined by ̃𝐺(𝜆) = ∫ 𝑒𝚤𝜆𝑧𝑑𝐺(𝑧). Determine 𝐺so that it possesses the same moments as𝐺 up to order 𝑘 − 1, i.e.

∫ 𝑧𝑗𝑑(𝐺 − 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺can be chosen to have at most𝑘 support points.

Then for𝐺 and 𝐺supported on[−𝑎, 𝑎], we have

| ̃𝐺(𝜆) − ̃𝐺(𝜆)| = |

|

∫ (𝑒𝚤𝜆𝑧

𝑘−1

𝑗=0

(𝚤𝜆𝑧)𝑗

𝑗! ) 𝑑(𝐺 − 𝐺)(𝑧)|

|

≤ ∫|𝚤𝜆𝑧|𝑘

𝑘! 𝑑(𝐺 + 𝐺)(𝑧) ≤ (𝑎𝑒|𝜆|

𝑘 )𝑘.

The inequality comes from|𝑒𝑖𝑦−∑𝑘−1𝑗=0(𝑖𝑦)𝑗/𝑗!| ≤ |𝑦|𝑘/𝑘! ≤ (𝑒|𝑦|)𝑘/𝑘𝑘, for every𝑦 ∈ ℝ.

Therefore, by Hausdorff-Young’s inequality,

‖𝑝𝐺− 𝑝𝐺𝑝𝑞 ≤ 1

2𝜋∫ | ̃𝑓(𝜆)|𝑝| ̃𝐺(𝜆) − ̃𝐺(𝜆)|𝑝𝑑𝜆

≲ ∫

|𝜆|>𝑀

| ̃𝑓(𝜆)|𝑝𝑑𝜆 + ∫

|𝜆|≤𝑀

(𝑒𝑎|𝜆|

𝑘 )𝑝𝑘𝑑𝜆.

We denote the first term in the preceding display by𝐼1and the second term by𝐼2. It is easy to bound𝐼2as

𝐼2≍ (𝑒𝑎

𝑘)𝑘𝑝𝑀𝑘𝑝+1

𝑘𝑝 + 1 ≲ (𝑒𝑎𝑀 𝑘 )𝑘𝑝+11

𝑝.

For𝐼1we separately consider the cases of ordinary smoothness and supersmoothness.

(11)

In the supersmooth case with parameter𝛽, we note that the func- tion(𝑡𝛽−1−1)/𝑒𝛿𝑡is monotonically decreasing for𝑡 ≥ 𝑝𝑀𝛽, when𝛿 ≥ (𝛽−1− 1)/(𝑝𝑀𝛽). Thus, for large 𝑀,

𝐼1≲ ∫

|𝜆|>𝑀

𝑒−𝑝|𝜆|𝛽𝑑𝜆 = 2 𝛽𝑝𝛽−1

𝑡>𝑝𝑀𝛽

𝑒−𝑡𝑡𝛽−1−1𝑑𝑡

≤ 2

𝛽𝑝𝛽−1

𝑡>𝑝𝑀𝛽

𝑒−(1−𝛿)𝑡𝑑𝑡(𝑝𝑀𝛽)𝛽−1−1 𝑒𝛿𝑝𝑀𝛽 = 2

1 − 𝛿 1

𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽, where the bound is sharper if𝛿 is smaller. Choosing the minimal value of𝛿, we obtain

𝐼1≲ 1

1 − (𝛽−1− 1)/(𝑝𝑀𝛽) 1

𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽≲ 𝑀1−𝛽𝑒−𝑝𝑀𝛽, for𝑀 sufficiently large. We next choose 𝑀 = 2(log(1/𝜀))1𝛽 in order to ensure that𝐼1≤ 𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 ≥ 2𝑒𝑎𝑀 and 2−𝑘𝑝≤ 𝜀𝑝. This is satisfied if𝑘 = 2(log 𝜀−1)max(𝛽−1,1).

In the ordinary smooth case with smoothness parameter𝛽, we have the bound

𝐼1≲ ∫

|𝜆|>𝑀

|𝜆|−𝛽𝑝𝑑𝜆 ≲ (1 𝑀)𝛽𝑝−1.

We choose𝑀 = (1/𝜀)−(𝛽−1/𝑝)−1 to render the right side equal to𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 = 2𝜀−(𝛽−1/𝑝)−1.

The number of support points in the preceding lemma is increas- ing in𝑞 and decreasing in 𝛽. For approximation in the 𝐿2-norm (𝑞 = 2), the number of support points is of order 𝜀−1/(𝛽−1/2), and this re- duces to𝜀−2/3for the Laplace kernel (ordinary smooth with𝛽 = 2). A interpretation of the exponent𝛽−1/2 is the (almost) Sobolev smooth- ness of𝑝𝐺, since, for𝛼 < 𝛽 − 1/2,

∫(1 + |𝜆|2)𝛼| ̃𝑝𝐺(𝜆)|2𝑑𝜆 ≲ ∫(1 + |𝜆|2)𝛼| ̃𝑓(𝜆)|2𝑑𝜆 < ∞.

We do not have a compelling intuition for this correspondence.

The Hellinger distance is more sensitive to areas where the densi- ties are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.

Lemma 1.4. For a probability measure𝐺 supported on [−𝑎, 𝑎] there exists a discrete measure𝐺with at most𝑁 ≍ 𝜀−2/3support points

(12)

1.4 finite approximation

such that for𝑝𝐺 = 𝑓 ∗ 𝐺 and 𝑓 the Laplace density ℎ(𝑝𝐺, 𝑝𝐺) ≤ 𝜀.

Proof. Since𝑝𝐺(𝑥) ≥ 𝑓(|𝑥| + 𝑎) = 𝑒−𝑎𝑒−|𝑥|/2, for every 𝑥 and proba- bility measure𝐺 supported on [−𝑎, 𝑎], the Hellinger distance between Laplace mixtures satisfies

2(𝑝𝐺, 𝑝𝐺) ≤ ∫(𝑝𝐺− 𝑝𝐺)2

𝑝𝐺+ 𝑝𝐺 (𝑥) 𝑑𝑥 ≤ 𝑒𝑎∫(𝑝𝐺(𝑥) − 𝑝𝐺(𝑥))2𝑒|𝑥|𝑑𝑥.

If we write𝑞𝐺(𝑥) = 𝑝𝐺(𝑥)𝑒|𝑥|/2, and𝑞𝐺̃ for the corresponding Fourier transform, then by Plancherel’s theorem the integral in the right side is equal to

1

2𝜋∫ | ̃𝑞𝐺− ̃𝑞𝐺|2(𝜆) 𝑑𝜆.

By an explicit computation we obtain

̃

𝑞𝐺(𝜆) = 1

2∫ ∫ 𝑒𝚤𝜆𝑥𝑒−|𝑥−𝑧|+|𝑥|/2𝑑𝑥 𝑑𝐺(𝑧) =1

2∫ 𝑟(𝜆, 𝑧) 𝑑𝐺(𝑧), where𝑟(𝜆, 𝑧) is given by

𝑟(𝜆, 𝑧) = 𝑒−𝑧

𝚤𝜆 + 1/2+ 𝑒−𝑧𝑒(𝚤𝜆+3/2)𝑧− 1

𝚤𝜆 + 3/2 −𝑒(𝚤𝜆+1/2)𝑧 𝚤𝜆 − 1/2

= 𝑒−𝑧

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)− 2𝑒𝚤𝜆𝑧𝑒𝑧/2

(𝚤𝜆 + 3/2)(𝚤𝜆 − 1/2). (1.4) Now let𝐺be a discrete measure on[−𝑎, 𝑎] such that

∫ 𝑒−𝑧𝑑(𝐺− 𝐺)(𝑧) = 0,

∫ 𝑒𝑧/2𝑧𝑗𝑑(𝐺− 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺can be chosen to have at most𝑘 + 1 support points.

By the choice of𝐺the first term of𝑟(𝜆, 𝑧) gives no contribution to the difference∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺−𝐺)(𝑧). As the second term of 𝑟(𝜆, 𝑧) is for large|𝜆| bounded in absolute value by a multiple of |𝜆|−2, it follows that

𝐼2∶= ∫

|𝜆|>𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆 ≲ ∫

𝜆>𝑀

𝜆−4𝑑𝜆 ≍ 𝑀−3.

(13)

Again by the choice of𝐺, the integral∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧) remains the same if we replace𝑒𝚤𝜆𝑧by𝑒𝚤𝜆𝑧− ∑𝑘𝑗=0(𝚤𝜆𝑧)𝑗/𝑗! in the second term of𝑟(𝜆, 𝑧). It follows that

𝐼1∶= ∫

|𝜆|≤𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆

≤ ∫

|𝜆|≤𝑀

| 2

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)|

2

| ∫ 𝑒𝑧/2[𝑒𝚤𝜆𝑧

𝑘

𝑗=0

(𝚤𝜆𝑧)𝑗] 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆

≲ ∫

𝑀 0

(𝑧𝜆)2𝑘

(𝑘!)2 𝑑𝜆 ≲ (𝑎𝑒𝑀)2𝑘+1 𝑘2𝑘+1 .

It follows, by a similar argument as in the proof of Lemma 1.3, that we can reduce both𝐼1and𝐼2to𝜀2by choosing and𝑀 ≍ 𝜀−2/3and 𝑘 = 2𝑎𝑒𝑀.

1.5 entropy

We study the covering numbers of the class of mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, where𝐺 ranges over the collection ℳ[−𝑎, 𝑎] of all probability mea- sures on[−𝑎, 𝑎]. We present a bound for any 𝐿𝑟-norm and general kernels𝑓, and a bound for the Hellinger distance that is specific to the Laplace kernel. Note that𝑓(1)is the first derivative of𝑓.

Proposition 1.5. If both‖𝑓‖𝑟and‖𝑓(1)𝑟are finite and ̃𝑓 has ordinary smoothness𝛽, then, for 𝑝𝐺= 𝑓 ∗ 𝐺, and any 𝑟 ≥ 2,

log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ‖⋅‖𝑟) ≲ 𝜀𝛽−1+1/𝑟1 log 𝜀−1. (1.5) Proof. Consider an𝜀-net of 𝒢𝑎 = {𝑝𝐺 ∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]} by con- structing ℐ the collection of all𝑝𝐺’s such that the mixing measure 𝐺 ∈ ℳ[−𝑎, 𝑎] is discrete and has at most 𝑁 ≤ 𝐷𝜀−(𝛽−1+𝑟−1)−1support points for some proper constant𝐷.

In light of Lemma 1.3, the set of all mixtures𝑝𝐺with𝐺 a discrete probability measure with𝑁 ≲ 𝜀−(𝛽−1+𝑟−1)−1 support points forms an 𝜀-net over the set of all mixtures 𝑝𝐺 as in the lemma. It suffices to construct an𝜀-net of the given cardinality over this set of discrete mix- tures.

(14)

1.5 entropy

By Jensen’s inequality and Fubini’s theorem,

‖𝑓(⋅ − 𝜃) − 𝑓‖𝑟= (∫|𝜃 ∫

1 0

𝑓(1)(𝑥 − 𝜃𝑠) 𝑑𝑠|𝑟𝑑𝑥)

1/𝑟

≤ ‖𝑓(1)𝑟𝜃.

Furthermore, for any probability vectors𝑝 and 𝑝and locations𝜃𝑖,

𝑁

𝑖=1

𝑝𝑖𝑓(⋅ − 𝜃𝑖) −

𝑁

𝑖=1

𝑝𝑖𝑓(⋅ − 𝜃𝑖)‖

𝑟

𝑁

𝑖=1

|𝑝𝑖− 𝑝𝑖|‖𝑓(⋅ − 𝜃𝑖)‖𝑟

= ‖𝑓‖𝑟‖𝑝 − 𝑝1.

Combining these inequalities, we see that for two discrete probability measures𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺= ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖,

‖𝑝𝐺− 𝑝𝐺𝑟 ≤ ‖𝑓(1)𝑟max

𝑖 |𝜃𝑖− 𝜃𝑖| + ‖𝑓‖𝑟‖𝑝 − 𝑝1. (1.6) Thus we can construct an𝜀-net over the discrete mixtures by relocat- ing the support points(𝜃𝑖)𝑁𝑖=1to the nearest points(𝜃𝑖)𝑁𝑖=1in an𝜀-net on[−𝑎, 𝑎], and relocating the weights 𝑝 to the nearest point 𝑝in an 𝜀-net for the 𝑙1-norm over the𝑁-dimensional 𝑙1-unit simplex. This gives a set of at most

(2𝑎 𝜀 )

𝑁

(5 𝜀)

𝑁

∼ (10𝑎 𝜀2 )

𝑁

measures𝑝𝐺(cf. Lemma A.4 of [33] for the entropy of the𝑙1-unit sim- plex). This gives the bound of the lemma.

Proposition 1.6. For𝑓 the Laplace kernel and 𝑝𝐺= 𝑓 ∗ 𝐺,

log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ℎ) ≲ 𝜀−3/8log(𝜀−1). (1.7) Proof. Since the function √𝑓 is absolutely continuous with derivative 𝑥 ↦ −2−3/2𝑒−|𝑥|/2sgn(𝑥), we have by Jensen’s inequality and Fubini’s theorem that

2(𝑓, 𝑓(⋅ − 𝜃)) = ∫(𝜃 ∫

1 0

−2−3/2𝑒−|𝑥−𝜃𝑠|/2sgn(𝑥 − 𝜃𝑠) 𝑑𝑠)2𝑑𝑥

≤ 𝜃2

1 0

∫ 𝑒−|𝑥−𝜃𝑠|𝑑𝑥 𝑑𝑠 = 2𝜃2. It follows thatℎ(𝑓, 𝑓(⋅ − 𝜃)) ≲ 𝜃.

(15)

By convexity of the map(𝑢, 𝑣) ↦ (√𝑢 − √𝑣)2, we have

|

|

√∑𝑖

𝑝𝑖𝑓(⋅ − 𝜃𝑖) − √∑

𝑖

𝑝𝑖𝑓(⋅ − 𝜃𝑖)|

|

2

≤ ∑

𝑖

𝑝𝑖[√𝑓(⋅ − 𝜃𝑖) − √𝑓(⋅ − 𝜃𝑖)]2.

By integrating this inequality we see that the densities𝑝𝐺 and𝑝𝐺 with mixing distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺= ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 satisfy ℎ2(𝑝𝐺, 𝑝𝐺) ≲ ∑ 𝑝𝑖|𝜃𝑖− 𝜃𝑖|2≤ ‖𝜃 − 𝜃2.

For distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 and𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 with the same support points, but different weights, we have

2(𝑝𝐺, 𝑝𝐺) ≤ ∫( ∑𝑖=1𝑁 (𝑝𝑖− 𝑝𝑖)𝑓(𝑥 − 𝜃𝑖))2

𝑁𝑖=1(𝑝𝑖+ 𝑝𝑖)𝑓(𝑥 − 𝜃𝑖) 𝑑𝑥

≤ ∫ (

𝑁

𝑖=1

|𝑝𝑖− 𝑝𝑖|)2𝑓2(|𝑥| − 𝑎)

2𝑓(|𝑥| + 𝑎)𝑑𝑥 ≲ ‖𝑝 − 𝑝21. Therefore the bound follows by arguments similar as in the proof of Proposition 1.5, where presently we use Lemma 1.4 to determine suit- able finite approximations.

The map𝐺 ↦ 𝑝𝐺 = 𝑓 ∗ 𝐺 is one-to-one as soon as the charac- teristic function of𝑓 is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.

Proposition 1.7. For any𝑘 ≥ 1, and any sufficiently small 𝜀 > 0, log 𝑁(𝜀, ℳ[−𝑎, 𝑎], 𝑊𝑘) ≲ 𝜀−1log 𝜀−1. (1.8) The proposition is a consequence of Lemma 1.9, below, which ap- plies to the set of all Borel probability measures on a general metric space(𝛩, 𝜌) (cf. [61]).

Lemma 1.8. For any probability measure𝐺 concentrated on countably many disjoint sets𝛩1, 𝛩2, … and probability measure 𝐺concentrated on disjoint sets𝛩1, 𝛩2, …,

𝑊𝑘(𝐺, 𝐺) ≤ sup

𝑖

sup

𝜃𝑖∈𝛩𝑖 𝜃𝑖∈𝛩𝑖

𝜌(𝜃𝑖, 𝜃𝑖) + diam(𝛩)( ∑

𝑖

|𝐺(𝛩𝑖) − 𝐺(𝛩𝑖)|)1/𝑘.

(16)

1.6 prior mass

In particular,

𝑊𝑘( ∑

𝑖

𝑝𝑖𝛿𝜃𝑖, ∑

𝑖

𝑝𝑖𝛿𝜃𝑖) ≤ max

𝑖 𝜌(𝜃𝑖, 𝜃𝑖) + diam(𝛩)‖𝑝 − 𝑝1/𝑘1 . Proof. For𝑝𝑖= 𝐺(𝛩𝑖) and 𝑝𝑖 = 𝐺(𝛩𝑖) divide the interval [0, ∑𝑖𝑝𝑖∧ 𝑝𝑖] into disjoint intervals 𝐼𝑖of lengths𝑝𝑖∧ 𝑝𝑖. We couple variables

̄𝜃 and ̄𝜃by an auxiliary uniform variable𝑈. If 𝑈 ∈ 𝐼𝑖, then generate

̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and ̄𝜃∼ 𝐺(⋅|𝛩𝑖). Divide the remaining interval [∑𝑖𝑝𝑖∧ 𝑝𝑖, 1] into intervals 𝐽𝑖of lengths𝑝𝑖− 𝑝𝑖∧ 𝑝𝑖and, separately, intervals 𝐽𝑖of length𝑝𝑖 − 𝑝𝑖∧ 𝑝𝑖. If𝑈 ∈ 𝐽𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and if𝑈 ∈ 𝐽𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩𝑖). Then ̄𝜃 and ̄𝜃have marginal distributions𝐺 and 𝐺, and

𝔼𝜌𝑘( ̄𝜃, ̄𝜃) ≤ 𝔼[𝜌𝑘( ̄𝜃, ̄𝜃)1𝑈≤∑

𝑖𝑝𝑖∧𝑝𝑖] + diam(𝛩)𝑘ℙ(𝑈 > ∑

𝑖

𝑝𝑖∧ 𝑝𝑖).

The first term is bounded by the𝑘-th power of the first term of the lemma, while the probability in the second term is equal to1 − ∑𝑖𝑝𝑖∧ 𝑝𝑖 = ∑𝑖|𝑝𝑖− 𝑝𝑖|/2.

Lemma 1.9. For the set ℳ(𝛩) of all Borel probability measures on a metric space(𝛩, 𝜌), any 𝑘 ≥ 1, and 0 < 𝜀 < min{2/3, diam(𝛩)},

𝑁(𝜀, ℳ(𝛩), 𝑊𝑘) ≤ (4 diam(𝛩)

𝜀 )𝑘𝑁(𝜀,𝛩,𝜌).

Proof. For a minimal𝜀-net over 𝛩 of 𝑁 = 𝑁(𝜀, 𝛩, 𝜌) points, let 𝛩 =

𝑖𝛩𝑖be the partition obtained by assigning each𝜃 to a closest point.

For any𝐺 let 𝐺𝜀 = ∑𝑖𝐺(𝛩𝑖)𝛿𝜃𝑖, for arbitrary but fixed𝜃𝑖 ∈ 𝛩𝑖. Since 𝑊𝑘(𝐺, 𝐺𝜀) ≤ 𝜀 by Lemma 1.8, 𝑁(2𝜀, ℳ(𝛩), 𝑊𝑘) ≤ 𝑁(𝜀, ℳ𝜀, 𝑊𝑘) holds for ℳ𝜀 the set of all𝐺𝜀. We next form the measures𝐺𝜀,𝑝 =

𝑖𝑝𝑖𝛿𝜃𝑖for(𝑝1, … , 𝑝𝑁) ranging over an (𝜀/ diam(𝛩))𝑘-net for the𝑙1- distance over the𝑁-dimensional unit simplex. By Lemma 1.8 every 𝐺𝜀 is within𝑊𝑘-distance of some𝐺𝜀,𝑝. Therefore the proof is com- pleted because𝑁(𝜀, ℳ𝜀, 𝑊𝑘) is bounded from above by the number of points𝑝, which is bounded by (4 diam(𝛩)/𝜀)𝑘𝑁(cf. Lemma A.4 in [31]).

1.6 prior mass

This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neigh-

(17)

bourhood of a mixture𝑝𝐺0.

Proposition 1.10. If𝛱 is the Dirichlet process DP(𝛼) with base mea- sure𝛼 that has a Lebesgue density bounded away from 0 and ∞ on its support[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then for every sufficiently small𝜀 > 0 and every probability measure 𝐺0on[−𝑎, 𝑎],

log 𝛱(𝐺 ∶ 𝐾(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2, 𝐾2(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2) ≳ (1

𝜀)2/3log (1 𝜀).

Proof. By Lemma 1.4 there exists a discrete measure𝐺1 with𝑁 ≲ 𝜀−2/3support points such thatℎ(𝑝𝐺0, 𝑝𝐺1) ≤ 𝜀. We may assume that the support points of𝐺1are at least2𝜀2-separated. If not, we take a maximal2𝜀2-separated set in the support points of𝐺1, and replace 𝐺1by the discrete measure obtained by relocating the masses of𝐺1 to the nearest points in the2𝜀2-net. Thenℎ(𝑝𝐺1, 𝑝𝐺1) ≲ 𝜀2, as seen in the proof of Proposition 1.6.

Now by Lemmas 1.11 and 1.12, if𝐺1= ∑𝑁𝑖=1𝑝𝑗𝛿𝑧𝑗, with the support points𝑧𝑗at least2𝜀2-separated,

{𝐺 ∶ max(𝐾, 𝐾2)(𝑝𝐺0, 𝑝𝐺) < 𝑑1𝜀2} ⊃ {𝐺 ∶ ℎ(𝑝𝐺0, 𝑝𝐺) ≤ 2𝜀}

⊃ {𝐺 ∶ ℎ(𝑝𝐺1, 𝑝𝐺) ≤ 𝜀}

⊃ {𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺11≤ 𝑑2𝜀2}

⊃ {𝐺 ∶

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2}.

Since the base measure𝛼 has density bounded away from zero and infinity on[−𝑎, 𝑎] by assumption, by Lemma A.2 of [31], we have

log 𝛱(𝐺 ∶

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2) ≳ −𝑁 log 𝜀−1.

The lemma follows upon combining the preceding.

Lemma 1.11. If𝐺 = ∑𝑁𝑗=1𝑝𝑖𝛿𝑧𝑗 is a probability measure supported on points𝑧1, … , 𝑧𝑁inℝ with |𝑧𝑗 − 𝑧𝑘| > 2𝜀 for 𝑗 ≠ 𝑘, then for any probability measure𝐺 on ℝ and kernel 𝑓 with its derivative 𝑓(1),

‖𝑝𝐺− 𝑝𝐺1≤ 2‖𝑓(1)1𝜀 + 2

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀, 𝑧𝑗+ 𝜀] − 𝑝𝑗|.

(18)

1.7 proof of theorem 1.1

Lemma 1.12. If𝐺 and 𝐺are probability measures on[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then

2(𝑝𝐺, 𝑝𝐺) ≲ ‖𝑝𝐺− 𝑝𝐺2, (1.9) max (𝐾(𝑝𝐺, 𝑝𝐺), 𝐾2(𝑝𝐺, 𝑝𝐺)) ≲ ℎ2(𝑝𝐺, 𝑝𝐺). (1.10) Proofs. The first lemma is a generalization of Lemma 4 in [33] from normal to general kernels, and is proved in the same manner. We omit further details.

In view of the shape of the Laplace kernel, it is easy to see that for 𝐺 compactly supported on [−𝑎, 𝑎],

𝑓(|𝑥| + 𝑎) ≤ 𝑝𝐺(𝑥) ≤ 𝑓(|𝑥| − 𝑎), We bound the squared Hellinger distance as follows:

2(𝑝𝐺, 𝑝𝐺) ≤ ∫(𝑝𝐺− 𝑝𝐺)2 𝑝𝐺+ 𝑝𝐺 𝑑𝑥

≤ ∫

|𝑥|≤𝐴

𝑒𝐴+𝑎(𝑝𝐺− 𝑝𝐺)2𝑑𝑥 + ∫

|𝑥|>𝐴

(𝑝𝐺+ 𝑝𝐺)𝑑𝑥

≲ 𝑒𝑎‖𝑝𝐺− 𝑝𝐺22𝑒𝐴+ 𝑒−𝐴.

By the elementary inequality𝑡 +𝑢𝑡 ≥ 2√𝑢, for 𝑢, 𝑡 > 0, we obtain (1.9) upon choosing𝐴 = min(𝑎, log ‖𝑝𝐺− 𝑝𝐺−12 − 𝑎/2).

For the proof of the second assertion we first note that, if both𝐺 and𝐺are compactly supported on[−𝑎, 𝑎],

𝑝𝐺(𝑥)

𝑝𝐺(𝑥)≤ 𝑓(|𝑥| − 𝑎) 𝑓(|𝑥| + 𝑎) ≤ 𝑒2𝑎.

Therefore‖𝑝𝐺/𝑝𝐺 ≤ 𝑒2𝑎, and (1.10) follows by Lemma 8 in [33].

1.7 proof of theorem 1.1

The basic theorem of [30] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger dis- tance. In the present situation for the proofs of Theorems 1.1 and 1.2, we would like to apply this result to a power smaller than one of the Wasserstein metric and a power smaller than one of the𝐿𝑞distance, respectively, both of which are not metrics.

Consider a general “discrepancy measure”𝑑, which is a map 𝑑 ∶ 𝒫 × 𝒫 → ℝ on the product of the set of densities on a given mea-

(19)

surable space and itself, which has the properties, for some constant 𝐶 > 0:

(a) 𝑑(𝑥, 𝑦) ≥ 0;

(b) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;

(c) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);

(d) 𝑑(𝑥, 𝑦) ≤ 𝐶(𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧)).

Thus𝑑 is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant𝐶, possibly bigger than 1. Call a set of the form {𝑥 ∶ 𝑑(𝑥, 𝑦) < 𝑐} a 𝑑-ball, and define covering numbers𝑁(𝜀, 𝒫 , 𝑑) relative to 𝑑 as usual.

Let𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) be the posterior distribution of 𝑝 given an i.i.d. sample𝑋1, … , 𝑋𝑛from a density𝑝 that is equipped with a prior probability distribution𝛱.

Theorem 1.13. Suppose𝑑 has the properties as given, satisfies 𝑑(𝑝0, 𝑝) ≤ ℎ(𝑝0, 𝑝) for every 𝑝 ∈ 𝒫 , and the sets {𝑝 ∶ 𝑑(𝑝, 𝑝) < 𝛿} are convex.

Then𝛱𝑛(𝑑(𝑝, 𝑝0) > 𝑀𝜀𝑛∣ 𝑋1, … , 𝑋𝑛) → 0 in 𝑃0𝑛-probability for any 𝜀𝑛such that𝑛𝜀2𝑛→ ∞ and such that, for positive constants 𝑐1,𝑐2and sets 𝒫𝑛⊂ 𝒫 ,

log 𝑁(𝜀𝑛, 𝒫𝑛, 𝑑) ≤ 𝑐1𝑛𝜀𝑛2, (1.11) 𝛱𝑛(𝑝 ∶ 𝐾(𝑝0, 𝑝) < 𝜀2𝑛, 𝐾2(𝑝0, 𝑝) < 𝜀2𝑛) ≥ 𝑒−𝑐2𝑛𝜀2𝑛, (1.12) 𝛱𝑛(𝒫 − 𝒫𝑛) ≤ 𝑒−(𝑐2+4)𝑛𝜀2𝑛. (1.13) We defer the proof until Appendix A.

The proof of Theorem 1.1 is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem2 in [61]. We choose constant 𝐶𝑘carefully to make sure that the map𝜀 ↦ 𝜀[log(𝐶𝑘/𝜀)]𝑘+1/2is monotone on(0, 2].

Lemma 1.14. For probability measures𝐺 and 𝐺supported on[−𝑎, 𝑎], and𝑝𝐺 = 𝑓∗𝐺 for a probability density 𝑓 with inf𝜆(1+|𝜆|𝛽)| ̃𝑓(𝜆)| > 0, and any𝑘 ≥ 1,

𝑊𝑘(𝐺, 𝐺) ≲ ℎ(𝑝𝐺, 𝑝𝐺)1/(𝑘+𝛽)(log 𝐶𝑘

ℎ(𝑝𝐺, 𝑝𝐺))(𝑘+1/2)/(𝑘+𝛽)

.

Proof. By Theorem 6.15 in [76] the Wasserstein distance𝑊𝑘(𝐺, 𝐺) is bounded above by a multiple of the𝑘th root of ∫ |𝑥|𝑘𝑑|𝐺 − 𝐺|(𝑥), where|𝐺 − 𝐺| is the total variation measure of the difference 𝐺 −

(20)

1.7 proof of theorem 1.1

𝐺. We apply this to the convolutions of𝐺 and 𝐺with the normal distribution𝛷𝛿with mean0 and variance 𝛿2, to find, for every𝑀 > 0,

𝑊𝑘(𝐺 ∗ 𝛷𝛿,𝐺∗ 𝛷𝛿)𝑘≲ ∫ |𝑥|𝑘|(𝐺 − 𝐺) ∗ 𝜙𝛿(𝑥)| 𝑑𝑥

≤ (∫

𝑀

−𝑀

𝑥2𝑘𝑑𝑥 ∫

𝑀

−𝑀

|(𝐺 − 𝐺) ∗ 𝜙𝛿(𝑥)|2𝑑𝑥)1/2 + 𝑒−𝑀

|𝑥|>𝑀

|𝑥|𝑘𝑒|𝑥||(𝐺 − 𝐺) ∗ 𝜙𝛿(𝑥)| 𝑑𝑥

≲ 𝑀𝑘+1/2‖(𝐺 − 𝐺) ∗ 𝜙𝛿2+ 𝑒−𝑀𝑒2|𝑎|𝑒2|𝛿𝑍|,

where𝑍 is a standard normal variable. The number 𝐾𝛿∶= 𝑒2|𝑎|𝔼𝑒2|𝛿𝑍|

is uniformly bounded if𝛿 ≤ 𝛿𝑘, for some fixed𝛿𝑘. By Plancherel’s theorem,

‖(𝐺 − 𝐺) ∗ 𝜙𝛿22= ∫ | ̃𝐺 − ̃𝐺|2(𝜆) ̃𝜙2𝛿(𝜆) 𝑑𝜆

= ∫ | ̃𝑓( ̃𝐺 − ̃𝐺)|2(𝜆) 𝜙𝛿̃2

| ̃𝑓|2(𝜆) 𝑑𝜆

≲ ‖𝑝𝐺− 𝑝𝐺22sup

𝜆

̃𝜙2𝛿

| ̃𝑓|2(𝜆) ≲ ℎ2(𝑝𝐺, 𝑝𝐺)𝛿−2𝛽, where we have again applied Plancherel’s theorem, used that the𝐿2- metric on uniformly bounded densities is bounded by the Hellinger distance, and the assumption on the Fourier transform of𝑓, which shows that( ̃𝜙𝛿/| ̃𝑓|)(𝜆) ≲ (1 + |𝜆|𝛽)𝑒−𝛿2𝜆2/2≲ 𝛿−𝛽.

If𝑈 ∼ 𝐺 is independent of 𝑍 ∼ 𝑁(0, 1), then (𝑈, 𝑈 + 𝛿𝑍) gives a coupling of𝐺 and 𝐺 ∗ 𝛷𝛿. Therefore the definition of the Wasserstein metric gives that𝑊𝑘(𝐺, 𝐺 ∗ 𝛷𝛿)𝑘≤ 𝔼|𝛿𝑍|𝑘 ≲ 𝛿𝑘.

Combining the preceding inequalities with the triangle inequality we see that, for𝛿 ∈ (0, 𝛿𝑘] and any 𝑀 > 0,

𝑊𝑘(𝐺, 𝐺)𝑘 ≲ 𝑀𝑘+1/2ℎ(𝑝𝐺, 𝑝𝐺)𝛿−𝛽+ 𝑒−𝑀+ 𝛿𝑘.

The lemma follows by optimizing this over𝑀 and 𝛿. Specifically, for 𝜀 = ℎ(𝑝𝐺, 𝑝𝐺), 𝑀 = 𝑘/(𝑘 + 𝛽) log(𝐶𝑘/𝜀) and 𝛿 = (𝑀𝑘+1/2𝜀)1/(𝑘+𝛽) are eligible choices for

𝛿𝑘= sup

𝜀∈(0,2]

[ 𝑘

𝑘 + 𝛽log𝐶𝑘

𝜀 ](𝑘+1/2)/(𝑘+𝛽)

𝜀1/(𝑘+𝛽),

which is indeed a finite number. In fact the supremum is taken at𝜀 = 2, by the assumption on𝐶𝑘.

(21)

For the Laplace kernel𝑓 we choose 𝛽 = 2 in the preceding lemma, and then obtain that𝑑(𝑝𝐺, 𝑝𝐺) ≤ ℎ(𝑝𝐺, 𝑝𝐺), for the discrepancy 𝑑 = 𝛾−1(𝑊𝑘), and 𝛾(𝜀) = 𝐷𝑘𝜀1/(𝑘+𝛽)[log(𝐶𝑘/𝜀)](𝑘+1/2)/(𝑘+𝛽)a multiple of the (monotone) transformation in the right side of the preceding lemma. For small values of𝑊𝑘(𝐺1, 𝐺2) we have

𝑑(𝑝𝐺1, 𝑝𝐺2) ≍ 𝑊𝑘𝑘+2(𝐺1, 𝐺2)(log 1

𝑊𝑘(𝐺1, 𝐺2))−𝑘−1/2. (1.14) As𝑘 + 2 > 1 the discrepancy 𝑑 may not satisfy the triangle inequality, but it does possess the properties (a)–(d) of non-metrics in the para- graphs before Theorem 1.13. The balls of the discrepancy𝑑 are convex, as the Wasserstein metrics are convex (see [76]).

It follows that Theorem 1.13 applies to obtain a rate of posterior contraction relative to𝑑 and hence relative to

𝑊𝑘∼ 𝑑1/(𝑘+2)(log(1/𝑑))(𝑘+1/2)/(𝑘+2).

We apply the theorem with 𝒫 = 𝒫𝑛equal to the set of mixtures𝑝𝐺= 𝑓 ∗ 𝐺, as 𝐺 ranges over ℳ[−𝑎, 𝑎]. Thus (1.13) is trivially satisfied.

For the entropy condition (1.11), by Proposition 1.7, we have log 𝑁(𝜀, 𝒫𝑛, 𝑑) = log 𝑁(𝜀1/(𝑘+2)(log1

𝜀)(𝑘+1/2)/(𝑘+2)

, ℳ[−𝑎, 𝑎], 𝑊𝑘)

≲ (1

𝜀)1/(𝑘+2)(log1

𝜀)1+(𝑘+1/2)/(𝑘+2)

.

Thus (1.11) holds for the rate𝜀𝑛≳ 𝑛−𝛾, for every𝛾 < (𝑘 + 2)/(2𝑘 + 5).

In view of Proposition 1.10, the prior mass condition (1.12) is sat- isfied with the rate𝜀𝑛≍ (log 𝑛/𝑛)3/8.

Theorem 1.13 yields a rate of contraction relative to𝑑 equal to the slower of the two rates, which is(log 𝑛/𝑛)3/8. This translates into the rate for the Wasserstein distance as given in Theorem 1.1.

1.8 proof of theorem 1.2

We apply Theorem 1.13, with 𝒫 = 𝒫𝑛the set of all mixtures𝑝𝐺as 𝐺 ranges over ℳ[−𝑎, 𝑎]. For 𝑑 = ℎ the rate follows immediately by combining Propositions 1.6 and 1.10.

Since the densities𝑝𝐺are uniformly bounded by1/2, the 𝐿𝑞dis- tance‖𝑝𝐺− 𝑝𝐺𝑞is bounded above by a multiple ofℎ(𝑝𝐺, 𝑝𝐺)2/𝑞. We can therefore apply Theorem 1.13 with the discrepancy𝑑(𝑝, 𝑝) =

(22)

1.9 normal mixtures

‖𝑝 − 𝑝𝑞/2𝑞 . In view of Proposition 1.5

log 𝑁(𝜀, 𝒫𝑛, 𝑑) ≲ 𝜀−2/(𝑞+1)log 𝜀−1.

Therefore setting𝜀𝑛 ≍ (log 𝑛/𝑛)(𝑞+1)/(2𝑞+4)fulfills the entropy condi- tion (1.11). By Proposition 1.10 the prior mass condition is satisfied for 𝜀𝑛≍ (log 𝑛/𝑛)3/8. By Theorem 1.13 the rate of contraction relative to 𝑑 is the slower of these two rates, which is the first. The rate relative to the𝐿𝑞-norm is the(2/𝑞)-th power of this rate.

1.9 normal mixtures

We reproduce the results on normal mixtures from [31], but in𝐿2- norm. Note that the normal kernel is supersmooth with𝛽 = 2, by the Approximation Lemma 1.3, for any measure𝐺1compactly supported on[−𝑎, 𝑎] we can always find a discrete measure 𝐺2with number of support points of order𝑁 ≍ log 𝜀−1such that‖𝑝𝐺1− 𝑝𝐺22 ≤ 𝜀. By Lemma 4 in [33], we establish

2(𝑝𝐺1, 𝑝𝐺2) ≲ ‖𝑝𝐺1− 𝑝𝐺22.

Following the same procedure as before, assuming𝐺0is the true measure, we obtain for prior mass condition

log 𝛱 (𝐺 ∶ max(𝑃𝐺0log𝑝𝐺0

𝑝𝐺, 𝑃𝐺0(log𝑝𝐺0

𝑝𝐺)2) ≤ 𝜀2) ≳ −( log 𝜀−1)2, Thus we obtain𝜀𝑛= log 𝑛/√𝑛.

By Lemma 1.5, we have the following estimate for entropy condi- tion

log 𝑁(𝜀, 𝒢𝑎, ‖ ⋅ ‖2) ≲ (log 𝜀−1)2.

This coincides with the estimate of prior mass condition, thus we ob- tain the rate of𝜀𝑛 = log 𝑛/√𝑛 with respect to 𝐿2-norm. This is the same with what is obtained in [31], only in𝐿2-norm. However we lose a √log 𝑛-factor comparing to [77], which is √log 𝑛/𝑛.

(23)

Referenties

GERELATEERDE DOCUMENTEN

In example (71), the patient is covert since it is understood from the context. The verb takes an oblique root. Note that the occurrence of the patient after the verb does

The prefix ba- combined with absolute verb roots occurs in intransitive constructions expressing ‘to do a relatively time stable activity’. Derivational forms only take

Chapter 9 discussed the derived verb constructions. Verbs are derived from prefixation processes. Three general types of derived verb constructions can be distinguished with regard

We derive a contraction rate for the corre- sponding posterior distribution, both for the mixing distribution rel- ative to the Wasserstein metric and for the mixed density relative

Fengnan Gao: Bayes &amp; Networks, Dirichlet-Laplace Deconvolution and Statistical Inference in Preferential Attachment Networks, © April 2017.. The author designed the cover

plex networks are representations of complex systems that we wish to study, and network science, by all its means, is the science to study these complex systems beneath such

To sum up, the estimator works as proven in the main Theorem 3.2, but the exact performance depends on the true pa function and the degree of interest—if the true pa function

If the distribution of the initial