Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

(1)

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

Part I

N O N PA R A M E T R I C BAY E S I A N

D I R I C H L E T- L A P L A C E D E C O N VO L U T I O N

(3)

(4)

1

P O S T E R I O R C O N T R A C T I O N R AT E S F O R D E C O N V O L U T I O N O F D I R I C H L E T- L A PA L A C E M I X T U R E S

1.1 introduction

Consider statistical inference using the following nonparametric hi- erarchical Bayesian model for observations𝑋₁, … , 𝑋_𝑛:

(i) A probability distribution𝐺 on ℝ is generated from the Dirich- let process priorDP(𝛼) with base measure 𝛼.

(ii) An iid sample𝑍₁, … , 𝑍_𝑛is generated from𝐺.

(iii) An iid sample𝑒₁, … , 𝑒_𝑛is generated from a known density𝑓, independent of the other samples.

(iv) The observations are𝑋_𝑖= 𝑍_𝑖+ 𝑒_𝑖, for𝑖 = 1, … , 𝑛.

In this setting the conditional density of the data𝑋₁, … , 𝑋_𝑛given𝐺 is a sample from the convolution

𝑝_𝐺= 𝑓 ∗ 𝐺

of the density𝑓 and the measure 𝐺. The scheme defines a conditional distribution of𝐺 given the data 𝑋₁, … , 𝑋_𝑛, the posterior distribution of𝐺, and consequently also posterior distributions for quantities that derive from𝐺, including the convolution density 𝑝_𝐺. We are inter- ested in whether this posterior distribution can recover a true mixing distribution𝐺₀if the observations𝑋₁, … , 𝑋_𝑛are in reality a sample from the mixed distribution𝑝_𝐺₀, for some given probability distribu- tion𝐺₀.

The main contribution of this chapter is for the case that𝑓 is the Laplace density𝑓(𝑥) = 𝑒^−|𝑥|/2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy ([42]) and clustering problem in genetics ([7]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.

We consider two notions of recovery. The first notion measures the distance between the posterior of𝐺 and 𝐺₀through the Wasser- stein metric

𝑊_𝑘(𝐺, 𝐺^′) = inf

𝛾∈𝛤(𝐺,𝐺^′)(∫ |𝑥 − 𝑦|^𝑘𝑑𝛾(𝑥, 𝑦))^1/𝑘,

(5)

where𝛤(𝐺, 𝐺^′) is the collection of all couplings 𝛾 of 𝐺 and 𝐺^′into a bivariate measure with marginals𝐺 and 𝐺^′(i.e. if(𝑥, 𝑦) ∼ 𝛾, then 𝑥 ∼ 𝐺 and 𝑦 ∼ 𝐺^′), and𝑘 ≥ 1. The Wasserstein metric is a clas- sical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the total variation distance (which is more natural as a distance on densities), can be interpreted through transportation of measure (see [76]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by [61], within a general mixing framework. We refer to this paper for further motiva- tion of the Wasserstein metric for mixtures, and to [76] for general background on the Wasserstein metric. In this chapter we improve the upper bound on posterior contraction rates given in [61], at least in the case of the Laplace mixtures, obtaining a rate of nearly𝑛^−1/8 for𝑊₁(and slower rates for𝑘 > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribution by non-Bayesian methods is given in [80]. It is not clear from our result whether the upper bound𝑛^−1/8is sharp.

The second notion of recovery measures the distance of the posterior of𝐺 to 𝐺₀indirectly through the Hellinger or𝐿_𝑞-distances between the mixed densities𝑝_𝐺and𝑝_𝐺₀. This is equivalent to studying the estimation of the true density𝑝_𝐺₀of the observations through the density𝑝_𝐺under the posterior distribution. As the Laplace kernel𝑓 has Fourier transform

̃𝑓(𝜆) = 1 1 + 𝜆²,

it follows that the mixed densities𝑝_𝐺have Fourier transforms satisfy- ing

| ̃𝑝_𝐺(𝜆)| ≤ 1 1 + 𝜆².

Estimation of a density with a polynomially decaying Fourier transform was first considered in [77]. According to their Theorem in Sec- tion 3A, a suitable kernel estimator possesses a root mean square er- ror of𝑛^−3/8with respect to the𝐿₂-norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the usual rate𝑛^{−𝛼/(2𝛼+1)}of nonparametric estimation for smoothness𝛼 = 3/2. This is understandable as | ̃𝑝(𝜆)| ≲ 1/(1 + |𝜆|²) implies that ∫(1 +

|𝜆|²)^𝛼| ̃𝑝(𝜆)|²𝑑𝜆 < ∞, for every 𝛼 < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class

(6)

1.1 introduction

of regularity𝛼 < 3/2. Indeed in [34], the rate 𝑛^{−𝛼/(2𝛼+1)}is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In this chapter we show that the posterior distribution of Laplace mixtures𝑝_𝐺contracts to𝑝_𝐺₀at the rate𝑛^−3/8up to a logarithm factor, relative to the𝐿₂-norm and Hellinger distance, and also establish rates for other𝐿_𝑞-metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate𝑛^−1/3(the usual nonparametric rate for𝛼 = 1), which is slower than 𝑛^−3/8, and hence this insight is misleading.

Besides recovery relative to the Wasserstein metric and the in- duced metrics on𝑝_𝐺, one might consider recovery relative to a metric on the distribution function on𝐺. Frequentist recovery rates for this problem were obtained in [27] under some restrictions. There is no simple relation between these rates and rates for the other metrics. The same is true for the rates for deconvolution of densities, as in [27]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of𝐺.

Contraction rates for Dirichlet mixtures of the normal kernel were considered in [31, 33, 44, 71, 72]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of ap- proximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass con- centration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further derive a relationship of these metrics with a power of the Hellinger distance, and next apply a variant of the contraction theorem in [30], whose proof is included in the appendix of the dissertation. Contrac- tion rates of mixtures with other priors than the Dirichlet were considered in [71]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A general approach to posterior contraction rates in inverse problems can be found in [41], and results specific to deconvolution can be found in [24]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, and hence their results are not directly comparable to the results in this dissertation.

The papers [28, 49] consider recovery of a mixing density relative

(7)

to the𝐿_𝑝-norm in the frequentist setting. If the smoothness of the mixing density degenerates to0, then the minimax rate decreases to a constant and it is not possible to find a consistent estimator. In this chapter we show that in the same problem but viewed as a deconvolution problem on distributions, endowed with the weaker Wasserstein distance, we may obtain polynomial rates for the mixing distribution without any smoothness assumption on the distribution. In particular, for any mixing distribution it is possible to construct a consistent estimator.

The chapter is organized as follows. In the next section we give notation and preliminaries. We state in Section 1.3 the main results of the chapter, which are proved in the subsequent sections. In Section 1.4 we establish suitable finite approximations relative to the𝐿_𝑞- and He- llinger distances. The𝐿_𝑞-approximations also apply to other kernels than the Laplace kernel, and are in terms of the tail decay of the kernel’s characteristic function. In Sections 1.5 and 1.6 we apply these approximations to obtain bounds on the entropy of the mixtures relative to the𝐿_𝑞, Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 1.7 and 1.8 contain the proofs of the main results.

1.2 notation and preliminaries

Throughout the chapter integrals given without limits are considered to be integrals over the real lineℝ. The 𝐿_𝑞-norm is denoted by

‖𝑔‖_𝑞 = (∫ |𝑔(𝑥)|^𝑞𝑑𝑥)^1/𝑞,

with‖⋅‖_∞being the uniform norm. The Hellinger distance on the space of densities is given by

ℎ(𝑓, 𝑔) = (∫(𝑓^1/2(𝑥) − 𝑔^1/2(𝑥))²𝑑𝑥)^1/2.

It is easy to see thatℎ²(𝑓, 𝑔) ≤ ‖𝑓 − 𝑔‖₁≤ 2ℎ(𝑓, 𝑔), for any two probability densities𝑓 and 𝑔. Furthermore, if the densities 𝑓 and 𝑔 are uniformly bounded by a constant𝑀, then ‖𝑓 − 𝑔‖₂ ≤ 2√𝑀ℎ(𝑓, 𝑔).

The Kullback-Leibler discrepancy and corresponding variance are denoted by

𝐾(𝑝₀, 𝑝) = ∫ log(𝑝₀/𝑝) 𝑑𝑃₀, 𝐾₂(𝑝₀, 𝑝) = ∫(log(𝑝₀/𝑝))²𝑑𝑃₀,

(8)

1.3 main results

with𝑃₀the measure corresponding to the density𝑝₀.

We are primarily interested in the Laplace kernel, but a number of results are true for general kernels𝑓. The Fourier transform of a function𝑓 and the inverse Fourier transform of a function ̃𝑓 are given by

̃𝑓(𝜆) = ∫ 𝑒^𝚤𝜆𝑥𝑓(𝑥)𝑑𝑥, 𝑓(𝑥) = 1

2𝜋∫ 𝑒^{−𝚤𝜆𝑥}𝑓(𝜆)𝑑𝜆.̃ For_𝑝¹ +¹_𝑞 = 1 and 1 ≤ 𝑝 ≤ 2, Hausdorff-Young’s inequality gives that

‖𝑓‖_𝑞 ≤ (2𝜋)^−1/𝑝‖ ̃𝑓‖_𝑝.

The covering number𝑁(𝜀, 𝛩, 𝜌) of a metric space (𝛩, 𝜌) is the minimum number of𝜀-balls needed to cover the entire space 𝛩.

Throughout the chapter≲ denotes inequality up to a constant multiple, where the constant is universal or fixed within the context. Fur- thermore𝑎_𝑛≍ 𝑏_𝑛means that for some positive constants𝑐 and 𝐶

𝑐 ≤ lim inf

𝑛→∞ 𝑎_𝑛/𝑏_𝑛≤ lim sup

𝑛→∞

𝑎_𝑛/𝑏_𝑛≤ 𝐶.

We denote by ℳ[−𝑎, 𝑎] the set of all probability measures on a given interval[−𝑎, 𝑎].

1.3 main results

Write𝛱_𝑛(⋅ ∣ 𝑋₁, … , 𝑋_𝑛) as the posterior distribution for 𝐺 in the scheme (i)-(iv) introduced at the beginning of the chapter. We study this random distribution assuming that𝑋₁, … , 𝑋_𝑛are an iid sample from the mixture density𝑝_𝐺₀ = 𝑓 ∗ 𝐺₀, for a given probability dis- tribution𝐺₀. We assume that𝐺₀is supported in a compact interval [−𝑎, 𝑎], and that the base measure 𝛼 of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from0 and ∞.

Theorem 1.1. If𝐺₀is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then for every 𝑘 ≥ 1, there exists a constant 𝑀 such that

𝛱(𝐺 ∶ 𝑊_𝑘(𝐺, 𝐺₀) ≥ 𝑀𝑛⁻^8𝑘+16³ (log 𝑛)^𝑘+7/8^𝑘+2 ∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.1) in𝑃_𝐺₀-probability.

The rate for the Wasserstein metric𝑊_𝑘given in the theorem deteriorates with increasing𝑘, which is perhaps not unreasonable as the Wasserstein metrics increase with𝑘. The fastest rate is obtained for

(9)

𝑊₁at𝑛^−1/8(log 𝑛)^5/8.

Theorem 1.2. If𝐺₀is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then there exists a constant 𝑀 such that

𝛱_𝑛(𝐺 ∶ ℎ(𝑝_𝐺, 𝑝_𝐺₀) ≥ 𝑀(log 𝑛/𝑛)^3/8∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.2) in𝑃_𝐺₀-probability. Furthermore, for every𝑞 ∈ [2, ∞) there exists 𝑀_𝑞 such that

𝛱_𝑛(𝐺 ∶ ‖𝑝_𝐺− 𝑝_𝐺₀‖_𝑞≥ 𝑀_𝑞(log 𝑛/𝑛)(𝑞+1)/(𝑞(𝑞+2))∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.3) in𝑃_𝐺₀-probability.

The rate for the𝐿_𝑞-distance given in (1.3) deteriorates with increas- ing𝑞. For 𝑞 = 2 it is the same as the rate (log 𝑛/𝑛)^3/8for the Hellinger distance.

In both theorems the mixing distributions are assumed to be supported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.

1.4 finite approximation

In this section we show that a general mixture𝑝_𝐺 can be approxi- mated by a mixture with finitely many components, where the number of components depends on the accuracy of the approximation, the distance used, and the kernel𝑓. We first consider approximations with respect to the𝐿_𝑞-norm, which applies to mixtures𝑝_𝐺 = 𝑓 ∗ 𝐺, for a general kernel𝑓, and next approximations with respect to the Hellinger distance for the case that𝑓 is the Laplace kernel. The first result generalizes the result of [31] for normal mixtures. Also see [71]

for results on Dirichlet mixtures of exponential power densities.

The result splits in two cases, depending on the tail behaviour of the Fourier transform ̃𝑓 of 𝑓:

-ordinary smooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)||𝜆|^𝛽< ∞, for some 𝛽 > 1/2.

-supersmooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)|𝑒^|𝜆|^𝛽 < ∞, for some 𝛽 > 0.

Lemma 1.3 (Approximation Lemma). Let𝜀 < 1 be sufficiently small and fixed. For a probability measure𝐺 on an interval [−𝑎, 𝑎] and 2 ≤

(10)

1.4 finite approximation

𝑞 ≤ ∞, there exists a discrete measure 𝐺^′on[−𝑎, 𝑎] with at most 𝑁 support points in[−𝑎, 𝑎] such that

‖𝑝_𝐺− 𝑝_𝐺^′‖_𝑞≲ 𝜀, where

(i) 𝑁 ≲ 𝜀^{−(𝛽−𝑝}⁻¹⁾⁻¹if𝑓 is ordinary smooth of order 𝛽 with 𝛽 > 𝑝⁻¹, for𝑝 and 𝑞 being conjugate (𝑝⁻¹+ 𝑞⁻¹ = 1).

(ii) 𝑁 ≲ (log 𝜀⁻¹)^max(1,𝛽⁻¹⁾if𝑓 is supersmooth of order 𝛽.

Proof. The Fourier transform of𝑝_𝐺is given by𝑓 ̃𝐺, where ̃̃ 𝐺 is the Fourier transform of𝐺 defined by ̃𝐺(𝜆) = ∫ 𝑒^𝚤𝜆𝑧𝑑𝐺(𝑧). Determine 𝐺^′so that it possesses the same moments as𝐺 up to order 𝑘 − 1, i.e.

∫ 𝑧^𝑗𝑑(𝐺 − 𝐺^′)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺^′can be chosen to have at most𝑘 support points.

Then for𝐺 and 𝐺^′supported on[−𝑎, 𝑎], we have

| ̃𝐺(𝜆) − ̃𝐺^′(𝜆)| = |

|

∫ (𝑒^𝚤𝜆𝑧−

𝑘−1

∑

𝑗=0

(𝚤𝜆𝑧)^𝑗

𝑗! ) 𝑑(𝐺 − 𝐺^′)(𝑧)|

|

≤ ∫|𝚤𝜆𝑧|^𝑘

𝑘! 𝑑(𝐺 + 𝐺^′)(𝑧) ≤ (𝑎𝑒|𝜆|

𝑘 )^𝑘.

The inequality comes from|𝑒^𝑖𝑦−∑^𝑘−1_𝑗=0(𝑖𝑦)^𝑗/𝑗!| ≤ |𝑦|^𝑘/𝑘! ≤ (𝑒|𝑦|)^𝑘/𝑘^𝑘, for every𝑦 ∈ ℝ.

Therefore, by Hausdorff-Young’s inequality,

‖𝑝_𝐺− 𝑝_𝐺^′‖^𝑝_𝑞 ≤ 1

2𝜋∫ | ̃𝑓(𝜆)|^𝑝| ̃𝐺(𝜆) − ̃𝐺^′(𝜆)|^𝑝𝑑𝜆

≲ ∫

|𝜆|>𝑀

| ̃𝑓(𝜆)|^𝑝𝑑𝜆 + ∫

|𝜆|≤𝑀

(𝑒𝑎|𝜆|

𝑘 )^𝑝𝑘𝑑𝜆.

We denote the first term in the preceding display by𝐼₁and the second term by𝐼₂. It is easy to bound𝐼₂as

𝐼₂≍ (𝑒𝑎

𝑘)^𝑘𝑝𝑀^𝑘𝑝+1

𝑘𝑝 + 1 ≲ (𝑒𝑎𝑀 𝑘 )^𝑘𝑝+11

𝑝.

For𝐼₁we separately consider the cases of ordinary smoothness and supersmoothness.

(11)

In the supersmooth case with parameter𝛽, we note that the function(𝑡^𝛽⁻¹⁻¹)/𝑒^𝛿𝑡is monotonically decreasing for𝑡 ≥ 𝑝𝑀^𝛽, when𝛿 ≥ (𝛽⁻¹− 1)/(𝑝𝑀^𝛽). Thus, for large 𝑀,

𝐼₁≲ ∫

|𝜆|>𝑀

𝑒^{−𝑝|𝜆|}^𝛽𝑑𝜆 = 2 𝛽𝑝^𝛽⁻¹ ∫

𝑡>𝑝𝑀^𝛽

𝑒^−𝑡𝑡^𝛽⁻¹⁻¹𝑑𝑡

≤ 2

𝛽𝑝^𝛽⁻¹ ∫

𝑡>𝑝𝑀^𝛽

𝑒^{−(1−𝛿)𝑡}𝑑𝑡(𝑝𝑀^𝛽)^𝛽⁻¹⁻¹ 𝑒^𝛿𝑝𝑀^𝛽 = 2

1 − 𝛿 1

𝛽𝑝𝑒^−𝑝𝑀^𝛽𝑀^1−𝛽, where the bound is sharper if𝛿 is smaller. Choosing the minimal value of𝛿, we obtain

𝐼₁≲ 1

1 − (𝛽⁻¹− 1)/(𝑝𝑀^𝛽) 1

𝛽𝑝𝑒^−𝑝𝑀^𝛽𝑀^1−𝛽≲ 𝑀^1−𝛽𝑒^−𝑝𝑀^𝛽, for𝑀 sufficiently large. We next choose 𝑀 = 2(log(1/𝜀))¹^𝛽 in order to ensure that𝐼₁≤ 𝜀^𝑝. Then𝐼₂≲ 𝜀^𝑝if𝑘 ≥ 2𝑒𝑎𝑀 and 2^−𝑘𝑝≤ 𝜀^𝑝. This is satisfied if𝑘 = 2(log 𝜀⁻¹)^max(𝛽⁻¹^,1).

In the ordinary smooth case with smoothness parameter𝛽, we have the bound

𝐼₁≲ ∫

|𝜆|>𝑀

|𝜆|^−𝛽𝑝𝑑𝜆 ≲ (1 𝑀)^𝛽𝑝−1.

We choose𝑀 = (1/𝜀)^{−(𝛽−1/𝑝)}⁻¹ to render the right side equal to𝜀^𝑝. Then𝐼₂≲ 𝜀^𝑝if𝑘 = 2𝜀^{−(𝛽−1/𝑝)}⁻¹.

The number of support points in the preceding lemma is increasing in𝑞 and decreasing in 𝛽. For approximation in the 𝐿₂-norm (𝑞 = 2), the number of support points is of order 𝜀^{−1/(𝛽−1/2)}, and this re- duces to𝜀^−2/3for the Laplace kernel (ordinary smooth with𝛽 = 2). A interpretation of the exponent𝛽−1/2 is the (almost) Sobolev smoothness of𝑝_𝐺, since, for𝛼 < 𝛽 − 1/2,

∫(1 + |𝜆|²)^𝛼| ̃𝑝_𝐺(𝜆)|²𝑑𝜆 ≲ ∫(1 + |𝜆|²)^𝛼| ̃𝑓(𝜆)|²𝑑𝜆 < ∞.

We do not have a compelling intuition for this correspondence.

The Hellinger distance is more sensitive to areas where the densities are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.

Lemma 1.4. For a probability measure𝐺 supported on [−𝑎, 𝑎] there exists a discrete measure𝐺^′with at most𝑁 ≍ 𝜀^−2/3support points

(12)

1.4 finite approximation

such that for𝑝_𝐺 = 𝑓 ∗ 𝐺 and 𝑓 the Laplace density ℎ(𝑝_𝐺, 𝑝_𝐺^′) ≤ 𝜀.

Proof. Since𝑝_𝐺(𝑥) ≥ 𝑓(|𝑥| + 𝑎) = 𝑒^−𝑎𝑒^−|𝑥|/2, for every 𝑥 and probability measure𝐺 supported on [−𝑎, 𝑎], the Hellinger distance between Laplace mixtures satisfies

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫(𝑝_𝐺− 𝑝_𝐺^′)²

𝑝_𝐺+ 𝑝_𝐺^′ (𝑥) 𝑑𝑥 ≤ 𝑒^𝑎∫(𝑝_𝐺^′(𝑥) − 𝑝_𝐺(𝑥))²𝑒^|𝑥|𝑑𝑥.

If we write𝑞_𝐺(𝑥) = 𝑝_𝐺(𝑥)𝑒^|𝑥|/2, and𝑞_𝐺̃ for the corresponding Fourier transform, then by Plancherel’s theorem the integral in the right side is equal to

1

2𝜋∫ | ̃𝑞_𝐺^′− ̃𝑞_𝐺|²(𝜆) 𝑑𝜆.

By an explicit computation we obtain

̃

𝑞_𝐺(𝜆) = 1

2∫ ∫ 𝑒^𝚤𝜆𝑥𝑒−|𝑥−𝑧|+|𝑥|/2𝑑𝑥 𝑑𝐺(𝑧) =1

2∫ 𝑟(𝜆, 𝑧) 𝑑𝐺(𝑧), where𝑟(𝜆, 𝑧) is given by

𝑟(𝜆, 𝑧) = 𝑒^−𝑧

𝚤𝜆 + 1/2+ 𝑒^−𝑧𝑒^{(𝚤𝜆+3/2)𝑧}− 1

𝚤𝜆 + 3/2 −𝑒^{(𝚤𝜆+1/2)𝑧} 𝚤𝜆 − 1/2

= 𝑒^−𝑧

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)− 2𝑒^𝚤𝜆𝑧𝑒^𝑧/2

(𝚤𝜆 + 3/2)(𝚤𝜆 − 1/2). (1.4) Now let𝐺^′be a discrete measure on[−𝑎, 𝑎] such that

∫ 𝑒^−𝑧𝑑(𝐺^′− 𝐺)(𝑧) = 0,

∫ 𝑒^𝑧/2𝑧^𝑗𝑑(𝐺^′− 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺^′can be chosen to have at most𝑘 + 1 support points.

By the choice of𝐺^′the first term of𝑟(𝜆, 𝑧) gives no contribution to the difference∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′−𝐺)(𝑧). As the second term of 𝑟(𝜆, 𝑧) is for large|𝜆| bounded in absolute value by a multiple of |𝜆|⁻², it follows that

𝐼₂∶= ∫

|𝜆|>𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆 ≲ ∫

𝜆>𝑀

𝜆⁻⁴𝑑𝜆 ≍ 𝑀⁻³.

(13)

Again by the choice of𝐺^′, the integral∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧) remains the same if we replace𝑒^𝚤𝜆𝑧by𝑒^𝚤𝜆𝑧− ∑^𝑘_𝑗=0(𝚤𝜆𝑧)^𝑗/𝑗! in the second term of𝑟(𝜆, 𝑧). It follows that

𝐼₁∶= ∫

|𝜆|≤𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆

≤ ∫

|𝜆|≤𝑀

| 2

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)|

2

| ∫ 𝑒^𝑧/2[𝑒^𝚤𝜆𝑧−

𝑘

∑

𝑗=0

(𝚤𝜆𝑧)^𝑗] 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆

≲ ∫

𝑀 0

(𝑧𝜆)^2𝑘

(𝑘!)² 𝑑𝜆 ≲ (𝑎𝑒𝑀)^2𝑘+1 𝑘^2𝑘+1 .

It follows, by a similar argument as in the proof of Lemma 1.3, that we can reduce both𝐼₁and𝐼₂to𝜀²by choosing and𝑀 ≍ 𝜀^−2/3and 𝑘 = 2𝑎𝑒𝑀.

1.5 entropy

We study the covering numbers of the class of mixtures𝑝_𝐺 = 𝑓 ∗ 𝐺, where𝐺 ranges over the collection ℳ[−𝑎, 𝑎] of all probability measures on[−𝑎, 𝑎]. We present a bound for any 𝐿_𝑟-norm and general kernels𝑓, and a bound for the Hellinger distance that is specific to the Laplace kernel. Note that𝑓⁽¹⁾is the first derivative of𝑓.

Proposition 1.5. If both‖𝑓‖_𝑟and‖𝑓⁽¹⁾‖_𝑟are finite and ̃𝑓 has ordinary smoothness𝛽, then, for 𝑝_𝐺= 𝑓 ∗ 𝐺, and any 𝑟 ≥ 2,

log 𝑁(𝜀, {𝑝_𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ‖⋅‖_𝑟) ≲ 𝜀⁻^{𝛽−1+1/𝑟}¹ log 𝜀⁻¹. (1.5) Proof. Consider an𝜀-net of 𝒢_𝑎 = {𝑝_𝐺 ∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]} by con- structing ℐ the collection of all𝑝_𝐺’s such that the mixing measure 𝐺 ∈ ℳ[−𝑎, 𝑎] is discrete and has at most 𝑁 ≤ 𝐷𝜀^{−(𝛽−1+𝑟}⁻¹⁾⁻¹support points for some proper constant𝐷.

In light of Lemma 1.3, the set of all mixtures𝑝_𝐺with𝐺 a discrete probability measure with𝑁 ≲ 𝜀^{−(𝛽−1+𝑟}⁻¹⁾⁻¹ support points forms an 𝜀-net over the set of all mixtures 𝑝_𝐺 as in the lemma. It suffices to construct an𝜀-net of the given cardinality over this set of discrete mixtures.

(14)

1.5 entropy

By Jensen’s inequality and Fubini’s theorem,

‖𝑓(⋅ − 𝜃) − 𝑓‖_𝑟= (∫|𝜃 ∫

1 0

𝑓⁽¹⁾(𝑥 − 𝜃𝑠) 𝑑𝑠|^𝑟𝑑𝑥)

1/𝑟

≤ ‖𝑓⁽¹⁾‖_𝑟𝜃.

Furthermore, for any probability vectors𝑝 and 𝑝^′and locations𝜃_𝑖,

‖

𝑁

∑

𝑖=1

𝑝_𝑖𝑓(⋅ − 𝜃_𝑖) −

𝑁

∑

𝑖=1

𝑝^′_𝑖𝑓(⋅ − 𝜃_𝑖)‖

𝑟

≤

𝑁

∑

𝑖=1

|𝑝_𝑖− 𝑝^′_𝑖|‖𝑓(⋅ − 𝜃_𝑖)‖_𝑟

= ‖𝑓‖_𝑟‖𝑝 − 𝑝^′‖₁.

Combining these inequalities, we see that for two discrete probability measures𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖and𝐺^′= ∑^𝑁_𝑖=1𝑝^′_𝑖𝛿_𝜃^′_𝑖,

‖𝑝_𝐺− 𝑝_𝐺^′‖_𝑟 ≤ ‖𝑓⁽¹⁾‖_𝑟max

𝑖 |𝜃_𝑖− 𝜃^′_𝑖| + ‖𝑓‖_𝑟‖𝑝 − 𝑝^′‖₁. (1.6) Thus we can construct an𝜀-net over the discrete mixtures by relocating the support points(𝜃_𝑖)^𝑁_𝑖=1to the nearest points(𝜃^′_𝑖)^𝑁_𝑖=1in an𝜀-net on[−𝑎, 𝑎], and relocating the weights 𝑝 to the nearest point 𝑝^′in an 𝜀-net for the 𝑙₁-norm over the𝑁-dimensional 𝑙₁-unit simplex. This gives a set of at most

(2𝑎 𝜀 )

𝑁

(5 𝜀)

𝑁

∼ (10𝑎 𝜀² )

𝑁

measures𝑝_𝐺(cf. Lemma A.4 of [33] for the entropy of the𝑙₁-unit simplex). This gives the bound of the lemma.

Proposition 1.6. For𝑓 the Laplace kernel and 𝑝_𝐺= 𝑓 ∗ 𝐺,

log 𝑁(𝜀, {𝑝_𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ℎ) ≲ 𝜀^−3/8log(𝜀⁻¹). (1.7) Proof. Since the function √𝑓 is absolutely continuous with derivative 𝑥 ↦ −2^−3/2𝑒^−|𝑥|/2sgn(𝑥), we have by Jensen’s inequality and Fubini’s theorem that

ℎ²(𝑓, 𝑓(⋅ − 𝜃)) = ∫(𝜃 ∫

1 0

−2^−3/2𝑒^{−|𝑥−𝜃𝑠|/2}sgn(𝑥 − 𝜃𝑠) 𝑑𝑠)²𝑑𝑥

≤ 𝜃²∫

1 0

∫ 𝑒^{−|𝑥−𝜃𝑠|}𝑑𝑥 𝑑𝑠 = 2𝜃². It follows thatℎ(𝑓, 𝑓(⋅ − 𝜃)) ≲ 𝜃.

(15)

By convexity of the map(𝑢, 𝑣) ↦ (√𝑢 − √𝑣)², we have

|

√∑𝑖

𝑝_𝑖𝑓(⋅ − 𝜃_𝑖) − √∑

𝑖

𝑝_𝑖𝑓(⋅ − 𝜃^′_𝑖)|

|

2

≤ ∑

𝑖

𝑝_𝑖[√𝑓(⋅ − 𝜃_𝑖) − √𝑓(⋅ − 𝜃^′𝑖)]².

By integrating this inequality we see that the densities𝑝_𝐺 and𝑝_𝐺^′ with mixing distributions𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖and𝐺^′= ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃^′_𝑖 satisfy ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≲ ∑ 𝑝_𝑖|𝜃_𝑖− 𝜃^′_𝑖|²≤ ‖𝜃 − 𝜃^′‖²_∞.

For distributions𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖 and𝐺^′ = ∑^𝑁_𝑖=1𝑝^′_𝑖𝛿_𝜃_𝑖 with the same support points, but different weights, we have

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫( ∑_𝑖=1^𝑁 (𝑝_𝑖− 𝑝^′_𝑖)𝑓(𝑥 − 𝜃_𝑖))²

∑^𝑁_𝑖=1(𝑝_𝑖+ 𝑝^′𝑖)𝑓(𝑥 − 𝜃_𝑖) 𝑑𝑥

≤ ∫ (

𝑁

∑

𝑖=1

|𝑝_𝑖− 𝑝^′_𝑖|)²𝑓²(|𝑥| − 𝑎)

2𝑓(|𝑥| + 𝑎)𝑑𝑥 ≲ ‖𝑝 − 𝑝^′‖²₁. Therefore the bound follows by arguments similar as in the proof of Proposition 1.5, where presently we use Lemma 1.4 to determine suitable finite approximations.

The map𝐺 ↦ 𝑝_𝐺 = 𝑓 ∗ 𝐺 is one-to-one as soon as the characteristic function of𝑓 is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.

Proposition 1.7. For any𝑘 ≥ 1, and any sufficiently small 𝜀 > 0, log 𝑁(𝜀, ℳ[−𝑎, 𝑎], 𝑊_𝑘) ≲ 𝜀⁻¹log 𝜀⁻¹. (1.8) The proposition is a consequence of Lemma 1.9, below, which applies to the set of all Borel probability measures on a general metric space(𝛩, 𝜌) (cf. [61]).

Lemma 1.8. For any probability measure𝐺 concentrated on countably many disjoint sets𝛩₁, 𝛩₂, … and probability measure 𝐺^′concentrated on disjoint sets𝛩^′₁, 𝛩^′₂, …,

𝑊_𝑘(𝐺, 𝐺^′) ≤ sup

𝑖

sup

𝜃_𝑖∈𝛩_𝑖 𝜃^′𝑖∈𝛩𝑖^′

𝜌(𝜃_𝑖, 𝜃^′_𝑖) + diam(𝛩)( ∑

𝑖

|𝐺(𝛩_𝑖) − 𝐺^′(𝛩_𝑖^′)|)^1/𝑘.

(16)

1.6 prior mass

In particular,

𝑊_𝑘( ∑

𝑖

𝑝_𝑖𝛿_𝜃_𝑖, ∑

𝑖

𝑝^′_𝑖𝛿_𝜃^′_𝑖) ≤ max

𝑖 𝜌(𝜃_𝑖, 𝜃^′_𝑖) + diam(𝛩)‖𝑝 − 𝑝^′‖^1/𝑘1 . Proof. For𝑝_𝑖= 𝐺(𝛩_𝑖) and 𝑝^′_𝑖 = 𝐺^′(𝛩_𝑖^′) divide the interval [0, ∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖] into disjoint intervals 𝐼_𝑖of lengths𝑝_𝑖∧ 𝑝^′_𝑖. We couple variables

̄𝜃 and ̄𝜃^′by an auxiliary uniform variable𝑈. If 𝑈 ∈ 𝐼_𝑖, then generate

̄𝜃 ∼ 𝐺(⋅|𝛩_𝑖) and ̄𝜃^′∼ 𝐺^′(⋅|𝛩^′_𝑖). Divide the remaining interval [∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖, 1] into intervals 𝐽_𝑖of lengths𝑝_𝑖− 𝑝_𝑖∧ 𝑝^′_𝑖and, separately, intervals 𝐽_𝑖^′of length𝑝^′_𝑖 − 𝑝_𝑖∧ 𝑝^′_𝑖. If𝑈 ∈ 𝐽_𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩_𝑖) and if𝑈 ∈ 𝐽_𝑖^′, then generate ̄𝜃^′ ∼ 𝐺^′(⋅|𝛩_𝑖^′). Then ̄𝜃 and ̄𝜃^′have marginal distributions𝐺 and 𝐺^′, and

𝔼𝜌^𝑘( ̄𝜃, ̄𝜃^′) ≤ 𝔼[𝜌^𝑘( ̄𝜃, ̄𝜃^′)1_𝑈≤∑

𝑖𝑝_𝑖∧𝑝^′_𝑖] + diam(𝛩)^𝑘ℙ(𝑈 > ∑

𝑖

𝑝_𝑖∧ 𝑝^′_𝑖).

The first term is bounded by the𝑘-th power of the first term of the lemma, while the probability in the second term is equal to1 − ∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖 = ∑_𝑖|𝑝_𝑖− 𝑝^′_𝑖|/2.

Lemma 1.9. For the set ℳ(𝛩) of all Borel probability measures on a metric space(𝛩, 𝜌), any 𝑘 ≥ 1, and 0 < 𝜀 < min{2/3, diam(𝛩)},

𝑁(𝜀, ℳ(𝛩), 𝑊_𝑘) ≤ (4 diam(𝛩)

𝜀 )^{𝑘𝑁(𝜀,𝛩,𝜌)}.

Proof. For a minimal𝜀-net over 𝛩 of 𝑁 = 𝑁(𝜀, 𝛩, 𝜌) points, let 𝛩 =

∪_𝑖𝛩_𝑖be the partition obtained by assigning each𝜃 to a closest point.

For any𝐺 let 𝐺_𝜀 = ∑_𝑖𝐺(𝛩_𝑖)𝛿_𝜃_𝑖, for arbitrary but fixed𝜃_𝑖 ∈ 𝛩_𝑖. Since 𝑊_𝑘(𝐺, 𝐺_𝜀) ≤ 𝜀 by Lemma 1.8, 𝑁(2𝜀, ℳ(𝛩), 𝑊_𝑘) ≤ 𝑁(𝜀, ℳ_𝜀, 𝑊_𝑘) holds for ℳ𝜀 the set of all𝐺_𝜀. We next form the measures𝐺_𝜀,𝑝 =

∑_𝑖𝑝_𝑖𝛿_𝜃_𝑖for(𝑝₁, … , 𝑝_𝑁) ranging over an (𝜀/ diam(𝛩))^𝑘-net for the𝑙₁- distance over the𝑁-dimensional unit simplex. By Lemma 1.8 every 𝐺_𝜀 is within𝑊_𝑘-distance of some𝐺_𝜀,𝑝. Therefore the proof is com- pleted because𝑁(𝜀, ℳ𝜀, 𝑊_𝑘) is bounded from above by the number of points𝑝, which is bounded by (4 diam(𝛩)/𝜀)^𝑘𝑁(cf. Lemma A.4 in [31]).

1.6 prior mass

This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neigh-

(17)

bourhood of a mixture𝑝_𝐺₀.

Proposition 1.10. If𝛱 is the Dirichlet process DP(𝛼) with base mea- sure𝛼 that has a Lebesgue density bounded away from 0 and ∞ on its support[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then for every sufficiently small𝜀 > 0 and every probability measure 𝐺₀on[−𝑎, 𝑎],

log 𝛱(𝐺 ∶ 𝐾(𝑝_𝐺, 𝑝_𝐺₀) ≤ 𝜀², 𝐾₂(𝑝_𝐺, 𝑝_𝐺₀) ≤ 𝜀²) ≳ (1

𝜀)^2/3log (1 𝜀).

Proof. By Lemma 1.4 there exists a discrete measure𝐺₁ with𝑁 ≲ 𝜀^−2/3support points such thatℎ(𝑝_𝐺₀, 𝑝_𝐺₁) ≤ 𝜀. We may assume that the support points of𝐺₁are at least2𝜀²-separated. If not, we take a maximal2𝜀²-separated set in the support points of𝐺₁, and replace 𝐺₁by the discrete measure obtained by relocating the masses of𝐺₁ to the nearest points in the2𝜀²-net. Thenℎ(𝑝_𝐺₁, 𝑝_𝐺^′₁) ≲ 𝜀², as seen in the proof of Proposition 1.6.

Now by Lemmas 1.11 and 1.12, if𝐺₁= ∑^𝑁_𝑖=1𝑝_𝑗𝛿_𝑧_𝑗, with the support points𝑧_𝑗at least2𝜀²-separated,

{𝐺 ∶ max(𝐾, 𝐾₂)(𝑝_𝐺₀, 𝑝_𝐺) < 𝑑₁𝜀²} ⊃ {𝐺 ∶ ℎ(𝑝_𝐺₀, 𝑝_𝐺) ≤ 2𝜀}

⊃ {𝐺 ∶ ℎ(𝑝_𝐺₁, 𝑝_𝐺) ≤ 𝜀}

⊃ {𝐺 ∶ ‖𝑝_𝐺− 𝑝_𝐺₁‖₁≤ 𝑑₂𝜀²}

⊃ {𝐺 ∶

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀², 𝑧_𝑗+ 𝜀²] − 𝑝_𝑗| ≤ 𝜀²}.

Since the base measure𝛼 has density bounded away from zero and infinity on[−𝑎, 𝑎] by assumption, by Lemma A.2 of [31], we have

log 𝛱(𝐺 ∶

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀², 𝑧_𝑗+ 𝜀²] − 𝑝_𝑗| ≤ 𝜀²) ≳ −𝑁 log 𝜀⁻¹.

The lemma follows upon combining the preceding.

Lemma 1.11. If𝐺^′ = ∑^𝑁_𝑗=1𝑝_𝑖𝛿_𝑧_𝑗 is a probability measure supported on points𝑧₁, … , 𝑧_𝑁inℝ with |𝑧_𝑗 − 𝑧_𝑘| > 2𝜀 for 𝑗 ≠ 𝑘, then for any probability measure𝐺 on ℝ and kernel 𝑓 with its derivative 𝑓⁽¹⁾,

‖𝑝_𝐺− 𝑝_𝐺^′‖₁≤ 2‖𝑓⁽¹⁾‖₁𝜀 + 2

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀, 𝑧_𝑗+ 𝜀] − 𝑝_𝑗|.

(18)

1.7 proof of theorem 1.1

Lemma 1.12. If𝐺 and 𝐺^′are probability measures on[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≲ ‖𝑝_𝐺− 𝑝_𝐺^′‖₂, (1.9) max (𝐾(𝑝_𝐺, 𝑝_𝐺^′), 𝐾₂(𝑝_𝐺, 𝑝_𝐺^′)) ≲ ℎ²(𝑝_𝐺, 𝑝_𝐺^′). (1.10) Proofs. The first lemma is a generalization of Lemma 4 in [33] from normal to general kernels, and is proved in the same manner. We omit further details.

In view of the shape of the Laplace kernel, it is easy to see that for 𝐺 compactly supported on [−𝑎, 𝑎],

𝑓(|𝑥| + 𝑎) ≤ 𝑝_𝐺(𝑥) ≤ 𝑓(|𝑥| − 𝑎), We bound the squared Hellinger distance as follows:

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫(𝑝_𝐺− 𝑝_𝐺^′)² 𝑝_𝐺+ 𝑝_𝐺^′ 𝑑𝑥

≤ ∫

|𝑥|≤𝐴

𝑒^𝐴+𝑎(𝑝_𝐺− 𝑝_𝐺^′)²𝑑𝑥 + ∫

|𝑥|>𝐴

(𝑝_𝐺+ 𝑝_𝐺^′)𝑑𝑥

≲ 𝑒^𝑎‖𝑝_𝐺− 𝑝_𝐺^′‖²₂𝑒^𝐴+ 𝑒^−𝐴.

By the elementary inequality𝑡 +^𝑢_𝑡 ≥ 2√𝑢, for 𝑢, 𝑡 > 0, we obtain (1.9) upon choosing𝐴 = min(𝑎, log ‖𝑝_𝐺− 𝑝_𝐺^′‖⁻¹₂ − 𝑎/2).

For the proof of the second assertion we first note that, if both𝐺 and𝐺^′are compactly supported on[−𝑎, 𝑎],

𝑝_𝐺(𝑥)

𝑝_𝐺^′(𝑥)≤ 𝑓(|𝑥| − 𝑎) 𝑓(|𝑥| + 𝑎) ≤ 𝑒^2𝑎.

Therefore‖𝑝_𝐺/𝑝_𝐺^′‖_∞ ≤ 𝑒^2𝑎, and (1.10) follows by Lemma 8 in [33].

1.7 proof of theorem 1.1

The basic theorem of [30] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger distance. In the present situation for the proofs of Theorems 1.1 and 1.2, we would like to apply this result to a power smaller than one of the Wasserstein metric and a power smaller than one of the𝐿_𝑞distance, respectively, both of which are not metrics.

Consider a general “discrepancy measure”𝑑, which is a map 𝑑 ∶ 𝒫 × 𝒫 → ℝ on the product of the set of densities on a given mea-

(19)

surable space and itself, which has the properties, for some constant 𝐶 > 0:

(a) 𝑑(𝑥, 𝑦) ≥ 0;

(b) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;

(c) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);

(d) 𝑑(𝑥, 𝑦) ≤ 𝐶(𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧)).

Thus𝑑 is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant𝐶, possibly bigger than 1. Call a set of the form {𝑥 ∶ 𝑑(𝑥, 𝑦) < 𝑐} a 𝑑-ball, and define covering numbers𝑁(𝜀, 𝒫 , 𝑑) relative to 𝑑 as usual.

Let𝛱_𝑛(⋅ ∣ 𝑋₁, … , 𝑋_𝑛) be the posterior distribution of 𝑝 given an i.i.d. sample𝑋₁, … , 𝑋_𝑛from a density𝑝 that is equipped with a prior probability distribution𝛱.

Theorem 1.13. Suppose𝑑 has the properties as given, satisfies 𝑑(𝑝₀, 𝑝) ≤ ℎ(𝑝₀, 𝑝) for every 𝑝 ∈ 𝒫 , and the sets {𝑝 ∶ 𝑑(𝑝, 𝑝^′) < 𝛿} are convex.

Then𝛱_𝑛(𝑑(𝑝, 𝑝₀) > 𝑀𝜀_𝑛∣ 𝑋₁, … , 𝑋_𝑛) → 0 in 𝑃₀^𝑛-probability for any 𝜀_𝑛such that𝑛𝜀²_𝑛→ ∞ and such that, for positive constants 𝑐₁,𝑐₂and sets 𝒫_𝑛⊂ 𝒫 ,

log 𝑁(𝜀_𝑛, 𝒫_𝑛, 𝑑) ≤ 𝑐₁𝑛𝜀_𝑛², (1.11) 𝛱_𝑛(𝑝 ∶ 𝐾(𝑝₀, 𝑝) < 𝜀²_𝑛, 𝐾₂(𝑝₀, 𝑝) < 𝜀²_𝑛) ≥ 𝑒^−𝑐²^𝑛𝜀²^𝑛, (1.12) 𝛱_𝑛(𝒫 − 𝒫_𝑛) ≤ 𝑒^−(𝑐²^+4)𝑛𝜀²^𝑛. (1.13) We defer the proof until Appendix A.

The proof of Theorem 1.1 is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem2 in [61]. We choose constant 𝐶_𝑘carefully to make sure that the map𝜀 ↦ 𝜀[log(𝐶_𝑘/𝜀)]^𝑘+1/2is monotone on(0, 2].

Lemma 1.14. For probability measures𝐺 and 𝐺^′supported on[−𝑎, 𝑎], and𝑝_𝐺 = 𝑓∗𝐺 for a probability density 𝑓 with inf_𝜆(1+|𝜆|^𝛽)| ̃𝑓(𝜆)| > 0, and any𝑘 ≥ 1,

𝑊_𝑘(𝐺, 𝐺^′) ≲ ℎ(𝑝_𝐺, 𝑝_𝐺^′)^{1/(𝑘+𝛽)}(log 𝐶_𝑘

ℎ(𝑝_𝐺, 𝑝_𝐺^′))(𝑘+1/2)/(𝑘+𝛽)

.

Proof. By Theorem 6.15 in [76] the Wasserstein distance𝑊_𝑘(𝐺, 𝐺^′) is bounded above by a multiple of the𝑘th root of ∫ |𝑥|^𝑘𝑑|𝐺 − 𝐺^′|(𝑥), where|𝐺 − 𝐺^′| is the total variation measure of the difference 𝐺 −

(20)

1.7 proof of theorem 1.1

𝐺^′. We apply this to the convolutions of𝐺 and 𝐺^′with the normal distribution𝛷_𝛿with mean0 and variance 𝛿², to find, for every𝑀 > 0,

𝑊_𝑘(𝐺 ∗ 𝛷_𝛿,𝐺^′∗ 𝛷_𝛿)^𝑘≲ ∫ |𝑥|^𝑘|(𝐺 − 𝐺^′) ∗ 𝜙_𝛿(𝑥)| 𝑑𝑥

≤ (∫

𝑀

−𝑀

𝑥^2𝑘𝑑𝑥 ∫

𝑀

−𝑀

|(𝐺 − 𝐺^′) ∗ 𝜙_𝛿(𝑥)|²𝑑𝑥)^1/2 + 𝑒^−𝑀∫

|𝑥|>𝑀

|𝑥|^𝑘𝑒^|𝑥||(𝐺 − 𝐺^′) ∗ 𝜙_𝛿(𝑥)| 𝑑𝑥

≲ 𝑀^𝑘+1/2‖(𝐺 − 𝐺^′) ∗ 𝜙_𝛿‖₂+ 𝑒^−𝑀𝑒^2|𝑎|𝑒^2|𝛿𝑍|,

where𝑍 is a standard normal variable. The number 𝐾_𝛿∶= 𝑒^2|𝑎|𝔼𝑒^2|𝛿𝑍|

is uniformly bounded if𝛿 ≤ 𝛿_𝑘, for some fixed𝛿_𝑘. By Plancherel’s theorem,

‖(𝐺 − 𝐺^′) ∗ 𝜙_𝛿‖²₂= ∫ | ̃𝐺 − ̃𝐺^′|²(𝜆) ̃𝜙²_𝛿(𝜆) 𝑑𝜆

= ∫ | ̃𝑓( ̃𝐺 − ̃𝐺^′)|²(𝜆) 𝜙_𝛿̃²

| ̃𝑓|²(𝜆) 𝑑𝜆

≲ ‖𝑝_𝐺− 𝑝_𝐺^′‖²₂sup

𝜆

̃𝜙²_𝛿

| ̃𝑓|²(𝜆) ≲ ℎ²(𝑝_𝐺, 𝑝_𝐺^′)𝛿^−2𝛽, where we have again applied Plancherel’s theorem, used that the𝐿₂- metric on uniformly bounded densities is bounded by the Hellinger distance, and the assumption on the Fourier transform of𝑓, which shows that( ̃𝜙_𝛿/| ̃𝑓|)(𝜆) ≲ (1 + |𝜆|^𝛽)𝑒^−𝛿²^𝜆²^/2≲ 𝛿^−𝛽.

If𝑈 ∼ 𝐺 is independent of 𝑍 ∼ 𝑁(0, 1), then (𝑈, 𝑈 + 𝛿𝑍) gives a coupling of𝐺 and 𝐺 ∗ 𝛷_𝛿. Therefore the definition of the Wasserstein metric gives that𝑊_𝑘(𝐺, 𝐺 ∗ 𝛷_𝛿)^𝑘≤ 𝔼|𝛿𝑍|^𝑘 ≲ 𝛿^𝑘.

Combining the preceding inequalities with the triangle inequality we see that, for𝛿 ∈ (0, 𝛿_𝑘] and any 𝑀 > 0,

𝑊_𝑘(𝐺, 𝐺^′)^𝑘 ≲ 𝑀^𝑘+1/2ℎ(𝑝_𝐺, 𝑝_𝐺^′)𝛿^−𝛽+ 𝑒^−𝑀+ 𝛿^𝑘.

The lemma follows by optimizing this over𝑀 and 𝛿. Specifically, for 𝜀 = ℎ(𝑝_𝐺, 𝑝_𝐺^′), 𝑀 = 𝑘/(𝑘 + 𝛽) log(𝐶_𝑘/𝜀) and 𝛿 = (𝑀^𝑘+1/2𝜀)^{1/(𝑘+𝛽)} are eligible choices for

𝛿_𝑘= sup

𝜀∈(0,2]

[ 𝑘

𝑘 + 𝛽log𝐶_𝑘

𝜀 ](𝑘+1/2)/(𝑘+𝛽)

𝜀^{1/(𝑘+𝛽)},

which is indeed a finite number. In fact the supremum is taken at𝜀 = 2, by the assumption on𝐶_𝑘.

(21)

For the Laplace kernel𝑓 we choose 𝛽 = 2 in the preceding lemma, and then obtain that𝑑(𝑝_𝐺, 𝑝_𝐺^′) ≤ ℎ(𝑝_𝐺, 𝑝_𝐺^′), for the discrepancy 𝑑 = 𝛾⁻¹(𝑊_𝑘), and 𝛾(𝜀) = 𝐷_𝑘𝜀^{1/(𝑘+𝛽)}[log(𝐶_𝑘/𝜀)](𝑘+1/2)/(𝑘+𝛽)a multiple of the (monotone) transformation in the right side of the preceding lemma. For small values of𝑊_𝑘(𝐺₁, 𝐺₂) we have

𝑑(𝑝_𝐺₁, 𝑝_𝐺₂) ≍ 𝑊_𝑘^𝑘+2(𝐺₁, 𝐺₂)(log 1

𝑊_𝑘(𝐺₁, 𝐺₂))^{−𝑘−1/2}. (1.14) As𝑘 + 2 > 1 the discrepancy 𝑑 may not satisfy the triangle inequality, but it does possess the properties (a)–(d) of non-metrics in the para- graphs before Theorem 1.13. The balls of the discrepancy𝑑 are convex, as the Wasserstein metrics are convex (see [76]).

It follows that Theorem 1.13 applies to obtain a rate of posterior contraction relative to𝑑 and hence relative to

𝑊_𝑘∼ 𝑑^1/(𝑘+2)(log(1/𝑑))(𝑘+1/2)/(𝑘+2).

We apply the theorem with 𝒫 = 𝒫_𝑛equal to the set of mixtures𝑝_𝐺= 𝑓 ∗ 𝐺, as 𝐺 ranges over ℳ[−𝑎, 𝑎]. Thus (1.13) is trivially satisfied.

For the entropy condition (1.11), by Proposition 1.7, we have log 𝑁(𝜀, 𝒫_𝑛, 𝑑) = log 𝑁(𝜀^1/(𝑘+2)(log1

𝜀)(𝑘+1/2)/(𝑘+2)

, ℳ[−𝑎, 𝑎], 𝑊_𝑘)

≲ (1

𝜀)^1/(𝑘+2)(log1

𝜀)1+(𝑘+1/2)/(𝑘+2)

.

Thus (1.11) holds for the rate𝜀_𝑛≳ 𝑛^−𝛾, for every𝛾 < (𝑘 + 2)/(2𝑘 + 5).

In view of Proposition 1.10, the prior mass condition (1.12) is satisfied with the rate𝜀_𝑛≍ (log 𝑛/𝑛)^3/8.

Theorem 1.13 yields a rate of contraction relative to𝑑 equal to the slower of the two rates, which is(log 𝑛/𝑛)^3/8. This translates into the rate for the Wasserstein distance as given in Theorem 1.1.

1.8 proof of theorem 1.2

We apply Theorem 1.13, with 𝒫 = 𝒫𝑛the set of all mixtures𝑝_𝐺as 𝐺 ranges over ℳ[−𝑎, 𝑎]. For 𝑑 = ℎ the rate follows immediately by combining Propositions 1.6 and 1.10.

Since the densities𝑝_𝐺are uniformly bounded by1/2, the 𝐿_𝑞dis- tance‖𝑝_𝐺− 𝑝_𝐺^′‖_𝑞is bounded above by a multiple ofℎ(𝑝_𝐺, 𝑝_𝐺^′)^2/𝑞. We can therefore apply Theorem 1.13 with the discrepancy𝑑(𝑝, 𝑝^′) =

(22)

1.9 normal mixtures

‖𝑝 − 𝑝^′‖^𝑞/2_𝑞 . In view of Proposition 1.5

log 𝑁(𝜀, 𝒫_𝑛, 𝑑) ≲ 𝜀^{−2/(𝑞+1)}log 𝜀⁻¹.

Therefore setting𝜀_𝑛 ≍ (log 𝑛/𝑛)(𝑞+1)/(2𝑞+4)fulfills the entropy condition (1.11). By Proposition 1.10 the prior mass condition is satisfied for 𝜀_𝑛≍ (log 𝑛/𝑛)^3/8. By Theorem 1.13 the rate of contraction relative to 𝑑 is the slower of these two rates, which is the first. The rate relative to the𝐿_𝑞-norm is the(2/𝑞)-th power of this rate.

1.9 normal mixtures

We reproduce the results on normal mixtures from [31], but in𝐿₂- norm. Note that the normal kernel is supersmooth with𝛽 = 2, by the Approximation Lemma 1.3, for any measure𝐺₁compactly supported on[−𝑎, 𝑎] we can always find a discrete measure 𝐺₂with number of support points of order𝑁 ≍ log 𝜀⁻¹such that‖𝑝_𝐺₁− 𝑝_𝐺₂‖₂ ≤ 𝜀. By Lemma 4 in [33], we establish

ℎ²(𝑝_𝐺₁, 𝑝_𝐺₂) ≲ ‖𝑝_𝐺₁− 𝑝_𝐺₂‖₂.

Following the same procedure as before, assuming𝐺₀is the true measure, we obtain for prior mass condition

log 𝛱 (𝐺 ∶ max(𝑃_𝐺₀log𝑝_𝐺₀

𝑝_𝐺, 𝑃_𝐺₀(log𝑝_𝐺₀

𝑝_𝐺)²) ≤ 𝜀²) ≳ −( log 𝜀⁻¹)², Thus we obtain𝜀_𝑛= log 𝑛/√𝑛.

By Lemma 1.5, we have the following estimate for entropy condition

log 𝑁(𝜀, 𝒢_𝑎, ‖ ⋅ ‖₂) ≲ (log 𝜀⁻¹)².

This coincides with the estimate of prior mass condition, thus we obtain the rate of𝜀_𝑛 = log 𝑛/√𝑛 with respect to 𝐿₂-norm. This is the same with what is obtained in [31], only in𝐿₂-norm. However we lose a √log 𝑛-factor comparing to [77], which is √log 𝑛/𝑛.

(23)