Cover Page
The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.
Author: Gao, F.
Title: Bayes and networks
Issue Date: 2017-05-23
Part I
N O N PA R A M E T R I C BAY E S I A N
D I R I C H L E T- L A P L A C E D E C O N VO L U T I O N
1
P O S T E R I O R C O N T R A C T I O N R AT E S F O R D E C O N V O L U T I O N O F D I R I C H L E T- L A PA L A C E M I X T U R E S
1.1 introduction
Consider statistical inference using the following nonparametric hi- erarchical Bayesian model for observations𝑋1, … , 𝑋𝑛:
(i) A probability distribution𝐺 on ℝ is generated from the Dirich- let process priorDP(𝛼) with base measure 𝛼.
(ii) An iid sample𝑍1, … , 𝑍𝑛is generated from𝐺.
(iii) An iid sample𝑒1, … , 𝑒𝑛is generated from a known density𝑓, independent of the other samples.
(iv) The observations are𝑋𝑖= 𝑍𝑖+ 𝑒𝑖, for𝑖 = 1, … , 𝑛.
In this setting the conditional density of the data𝑋1, … , 𝑋𝑛given𝐺 is a sample from the convolution
𝑝𝐺= 𝑓 ∗ 𝐺
of the density𝑓 and the measure 𝐺. The scheme defines a conditional distribution of𝐺 given the data 𝑋1, … , 𝑋𝑛, the posterior distribution of𝐺, and consequently also posterior distributions for quantities that derive from𝐺, including the convolution density 𝑝𝐺. We are inter- ested in whether this posterior distribution can recover a true mixing distribution𝐺0if the observations𝑋1, … , 𝑋𝑛are in reality a sample from the mixed distribution𝑝𝐺0, for some given probability distribu- tion𝐺0.
The main contribution of this chapter is for the case that𝑓 is the Laplace density𝑓(𝑥) = 𝑒−|𝑥|/2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy ([42]) and clustering problem in genetics ([7]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.
We consider two notions of recovery. The first notion measures the distance between the posterior of𝐺 and 𝐺0through the Wasser- stein metric
𝑊𝑘(𝐺, 𝐺′) = inf
𝛾∈𝛤(𝐺,𝐺′)(∫ |𝑥 − 𝑦|𝑘𝑑𝛾(𝑥, 𝑦))1/𝑘,
where𝛤(𝐺, 𝐺′) is the collection of all couplings 𝛾 of 𝐺 and 𝐺′into a bivariate measure with marginals𝐺 and 𝐺′(i.e. if(𝑥, 𝑦) ∼ 𝛾, then 𝑥 ∼ 𝐺 and 𝑦 ∼ 𝐺′), and𝑘 ≥ 1. The Wasserstein metric is a clas- sical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the to- tal variation distance (which is more natural as a distance on densi- ties), can be interpreted through transportation of measure (see [76]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by [61], within a general mixing framework. We refer to this paper for further motiva- tion of the Wasserstein metric for mixtures, and to [76] for general background on the Wasserstein metric. In this chapter we improve the upper bound on posterior contraction rates given in [61], at least in the case of the Laplace mixtures, obtaining a rate of nearly𝑛−1/8 for𝑊1(and slower rates for𝑘 > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribu- tion by non-Bayesian methods is given in [80]. It is not clear from our result whether the upper bound𝑛−1/8is sharp.
The second notion of recovery measures the distance of the pos- terior of𝐺 to 𝐺0indirectly through the Hellinger or𝐿𝑞-distances be- tween the mixed densities𝑝𝐺and𝑝𝐺0. This is equivalent to studying the estimation of the true density𝑝𝐺0of the observations through the density𝑝𝐺under the posterior distribution. As the Laplace kernel𝑓 has Fourier transform
̃𝑓(𝜆) = 1 1 + 𝜆2,
it follows that the mixed densities𝑝𝐺have Fourier transforms satisfy- ing
| ̃𝑝𝐺(𝜆)| ≤ 1 1 + 𝜆2.
Estimation of a density with a polynomially decaying Fourier trans- form was first considered in [77]. According to their Theorem in Sec- tion 3A, a suitable kernel estimator possesses a root mean square er- ror of𝑛−3/8with respect to the𝐿2-norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the usual rate𝑛−𝛼/(2𝛼+1)of nonparametric estimation for smoothness𝛼 = 3/2. This is understandable as | ̃𝑝(𝜆)| ≲ 1/(1 + |𝜆|2) implies that ∫(1 +
|𝜆|2)𝛼| ̃𝑝(𝜆)|2𝑑𝜆 < ∞, for every 𝛼 < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class
1.1 introduction
of regularity𝛼 < 3/2. Indeed in [34], the rate 𝑛−𝛼/(2𝛼+1)is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In this chapter we show that the posterior distribution of Laplace mixtures𝑝𝐺contracts to𝑝𝐺0at the rate𝑛−3/8up to a logarithm factor, relative to the𝐿2-norm and Hellinger distance, and also establish rates for other𝐿𝑞-metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate𝑛−1/3(the usual non- parametric rate for𝛼 = 1), which is slower than 𝑛−3/8, and hence this insight is misleading.
Besides recovery relative to the Wasserstein metric and the in- duced metrics on𝑝𝐺, one might consider recovery relative to a met- ric on the distribution function on𝐺. Frequentist recovery rates for this problem were obtained in [27] under some restrictions. There is no simple relation between these rates and rates for the other met- rics. The same is true for the rates for deconvolution of densities, as in [27]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of𝐺.
Contraction rates for Dirichlet mixtures of the normal kernel were considered in [31, 33, 44, 71, 72]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of ap- proximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass con- centration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further de- rive a relationship of these metrics with a power of the Hellinger dis- tance, and next apply a variant of the contraction theorem in [30], whose proof is included in the appendix of the dissertation. Contrac- tion rates of mixtures with other priors than the Dirichlet were consid- ered in [71]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A gen- eral approach to posterior contraction rates in inverse problems can be found in [41], and results specific to deconvolution can be found in [24]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, and hence their results are not directly comparable to the results in this dissertation.
The papers [28, 49] consider recovery of a mixing density relative
to the𝐿𝑝-norm in the frequentist setting. If the smoothness of the mixing density degenerates to0, then the minimax rate decreases to a constant and it is not possible to find a consistent estimator. In this chapter we show that in the same problem but viewed as a deconvolu- tion problem on distributions, endowed with the weaker Wasserstein distance, we may obtain polynomial rates for the mixing distribution without any smoothness assumption on the distribution. In particu- lar, for any mixing distribution it is possible to construct a consistent estimator.
The chapter is organized as follows. In the next section we give no- tation and preliminaries. We state in Section 1.3 the main results of the chapter, which are proved in the subsequent sections. In Section 1.4 we establish suitable finite approximations relative to the𝐿𝑞- and He- llinger distances. The𝐿𝑞-approximations also apply to other kernels than the Laplace kernel, and are in terms of the tail decay of the ker- nel’s characteristic function. In Sections 1.5 and 1.6 we apply these ap- proximations to obtain bounds on the entropy of the mixtures relative to the𝐿𝑞, Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 1.7 and 1.8 contain the proofs of the main results.
1.2 notation and preliminaries
Throughout the chapter integrals given without limits are considered to be integrals over the real lineℝ. The 𝐿𝑞-norm is denoted by
‖𝑔‖𝑞 = (∫ |𝑔(𝑥)|𝑞𝑑𝑥)1/𝑞,
with‖⋅‖∞being the uniform norm. The Hellinger distance on the space of densities is given by
ℎ(𝑓, 𝑔) = (∫(𝑓1/2(𝑥) − 𝑔1/2(𝑥))2𝑑𝑥)1/2.
It is easy to see thatℎ2(𝑓, 𝑔) ≤ ‖𝑓 − 𝑔‖1≤ 2ℎ(𝑓, 𝑔), for any two prob- ability densities𝑓 and 𝑔. Furthermore, if the densities 𝑓 and 𝑔 are uniformly bounded by a constant𝑀, then ‖𝑓 − 𝑔‖2 ≤ 2√𝑀ℎ(𝑓, 𝑔).
The Kullback-Leibler discrepancy and corresponding variance are de- noted by
𝐾(𝑝0, 𝑝) = ∫ log(𝑝0/𝑝) 𝑑𝑃0, 𝐾2(𝑝0, 𝑝) = ∫(log(𝑝0/𝑝))2𝑑𝑃0,
1.3 main results
with𝑃0the measure corresponding to the density𝑝0.
We are primarily interested in the Laplace kernel, but a number of results are true for general kernels𝑓. The Fourier transform of a function𝑓 and the inverse Fourier transform of a function ̃𝑓 are given by
̃𝑓(𝜆) = ∫ 𝑒𝚤𝜆𝑥𝑓(𝑥)𝑑𝑥, 𝑓(𝑥) = 1
2𝜋∫ 𝑒−𝚤𝜆𝑥𝑓(𝜆)𝑑𝜆.̃ For𝑝1 +1𝑞 = 1 and 1 ≤ 𝑝 ≤ 2, Hausdorff-Young’s inequality gives that
‖𝑓‖𝑞 ≤ (2𝜋)−1/𝑝‖ ̃𝑓‖𝑝.
The covering number𝑁(𝜀, 𝛩, 𝜌) of a metric space (𝛩, 𝜌) is the minimum number of𝜀-balls needed to cover the entire space 𝛩.
Throughout the chapter≲ denotes inequality up to a constant mul- tiple, where the constant is universal or fixed within the context. Fur- thermore𝑎𝑛≍ 𝑏𝑛means that for some positive constants𝑐 and 𝐶
𝑐 ≤ lim inf
𝑛→∞ 𝑎𝑛/𝑏𝑛≤ lim sup
𝑛→∞
𝑎𝑛/𝑏𝑛≤ 𝐶.
We denote by ℳ[−𝑎, 𝑎] the set of all probability measures on a given interval[−𝑎, 𝑎].
1.3 main results
Write𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) as the posterior distribution for 𝐺 in the scheme (i)-(iv) introduced at the beginning of the chapter. We study this random distribution assuming that𝑋1, … , 𝑋𝑛are an iid sample from the mixture density𝑝𝐺0 = 𝑓 ∗ 𝐺0, for a given probability dis- tribution𝐺0. We assume that𝐺0is supported in a compact interval [−𝑎, 𝑎], and that the base measure 𝛼 of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from0 and ∞.
Theorem 1.1. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then for every 𝑘 ≥ 1, there exists a constant 𝑀 such that
𝛱(𝐺 ∶ 𝑊𝑘(𝐺, 𝐺0) ≥ 𝑀𝑛−8𝑘+163 (log 𝑛)𝑘+7/8𝑘+2 ∣ 𝑋1, … , 𝑋𝑛) → 0, (1.1) in𝑃𝐺0-probability.
The rate for the Wasserstein metric𝑊𝑘given in the theorem dete- riorates with increasing𝑘, which is perhaps not unreasonable as the Wasserstein metrics increase with𝑘. The fastest rate is obtained for
𝑊1at𝑛−1/8(log 𝑛)5/8.
Theorem 1.2. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then there exists a constant 𝑀 such that
𝛱𝑛(𝐺 ∶ ℎ(𝑝𝐺, 𝑝𝐺0) ≥ 𝑀(log 𝑛/𝑛)3/8∣ 𝑋1, … , 𝑋𝑛) → 0, (1.2) in𝑃𝐺0-probability. Furthermore, for every𝑞 ∈ [2, ∞) there exists 𝑀𝑞 such that
𝛱𝑛(𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺0‖𝑞≥ 𝑀𝑞(log 𝑛/𝑛)(𝑞+1)/(𝑞(𝑞+2))∣ 𝑋1, … , 𝑋𝑛) → 0, (1.3) in𝑃𝐺0-probability.
The rate for the𝐿𝑞-distance given in (1.3) deteriorates with increas- ing𝑞. For 𝑞 = 2 it is the same as the rate (log 𝑛/𝑛)3/8for the Hellinger distance.
In both theorems the mixing distributions are assumed to be sup- ported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.
1.4 finite approximation
In this section we show that a general mixture𝑝𝐺 can be approxi- mated by a mixture with finitely many components, where the num- ber of components depends on the accuracy of the approximation, the distance used, and the kernel𝑓. We first consider approximations with respect to the𝐿𝑞-norm, which applies to mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, for a general kernel𝑓, and next approximations with respect to the Hellinger distance for the case that𝑓 is the Laplace kernel. The first result generalizes the result of [31] for normal mixtures. Also see [71]
for results on Dirichlet mixtures of exponential power densities.
The result splits in two cases, depending on the tail behaviour of the Fourier transform ̃𝑓 of 𝑓:
-ordinary smooth𝑓: lim sup
|𝜆|→∞
| ̃𝑓(𝜆)||𝜆|𝛽< ∞, for some 𝛽 > 1/2.
-supersmooth𝑓: lim sup
|𝜆|→∞
| ̃𝑓(𝜆)|𝑒|𝜆|𝛽 < ∞, for some 𝛽 > 0.
Lemma 1.3 (Approximation Lemma). Let𝜀 < 1 be sufficiently small and fixed. For a probability measure𝐺 on an interval [−𝑎, 𝑎] and 2 ≤
1.4 finite approximation
𝑞 ≤ ∞, there exists a discrete measure 𝐺′on[−𝑎, 𝑎] with at most 𝑁 support points in[−𝑎, 𝑎] such that
‖𝑝𝐺− 𝑝𝐺′‖𝑞≲ 𝜀, where
(i) 𝑁 ≲ 𝜀−(𝛽−𝑝−1)−1if𝑓 is ordinary smooth of order 𝛽 with 𝛽 > 𝑝−1, for𝑝 and 𝑞 being conjugate (𝑝−1+ 𝑞−1 = 1).
(ii) 𝑁 ≲ (log 𝜀−1)max(1,𝛽−1)if𝑓 is supersmooth of order 𝛽.
Proof. The Fourier transform of𝑝𝐺is given by𝑓 ̃𝐺, where ̃̃ 𝐺 is the Fourier transform of𝐺 defined by ̃𝐺(𝜆) = ∫ 𝑒𝚤𝜆𝑧𝑑𝐺(𝑧). Determine 𝐺′so that it possesses the same moments as𝐺 up to order 𝑘 − 1, i.e.
∫ 𝑧𝑗𝑑(𝐺 − 𝐺′)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.
By Lemma A.1 in [31]𝐺′can be chosen to have at most𝑘 support points.
Then for𝐺 and 𝐺′supported on[−𝑎, 𝑎], we have
| ̃𝐺(𝜆) − ̃𝐺′(𝜆)| = |
|
∫ (𝑒𝚤𝜆𝑧−
𝑘−1
∑
𝑗=0
(𝚤𝜆𝑧)𝑗
𝑗! ) 𝑑(𝐺 − 𝐺′)(𝑧)|
|
≤ ∫|𝚤𝜆𝑧|𝑘
𝑘! 𝑑(𝐺 + 𝐺′)(𝑧) ≤ (𝑎𝑒|𝜆|
𝑘 )𝑘.
The inequality comes from|𝑒𝑖𝑦−∑𝑘−1𝑗=0(𝑖𝑦)𝑗/𝑗!| ≤ |𝑦|𝑘/𝑘! ≤ (𝑒|𝑦|)𝑘/𝑘𝑘, for every𝑦 ∈ ℝ.
Therefore, by Hausdorff-Young’s inequality,
‖𝑝𝐺− 𝑝𝐺′‖𝑝𝑞 ≤ 1
2𝜋∫ | ̃𝑓(𝜆)|𝑝| ̃𝐺(𝜆) − ̃𝐺′(𝜆)|𝑝𝑑𝜆
≲ ∫
|𝜆|>𝑀
| ̃𝑓(𝜆)|𝑝𝑑𝜆 + ∫
|𝜆|≤𝑀
(𝑒𝑎|𝜆|
𝑘 )𝑝𝑘𝑑𝜆.
We denote the first term in the preceding display by𝐼1and the second term by𝐼2. It is easy to bound𝐼2as
𝐼2≍ (𝑒𝑎
𝑘)𝑘𝑝𝑀𝑘𝑝+1
𝑘𝑝 + 1 ≲ (𝑒𝑎𝑀 𝑘 )𝑘𝑝+11
𝑝.
For𝐼1we separately consider the cases of ordinary smoothness and supersmoothness.
In the supersmooth case with parameter𝛽, we note that the func- tion(𝑡𝛽−1−1)/𝑒𝛿𝑡is monotonically decreasing for𝑡 ≥ 𝑝𝑀𝛽, when𝛿 ≥ (𝛽−1− 1)/(𝑝𝑀𝛽). Thus, for large 𝑀,
𝐼1≲ ∫
|𝜆|>𝑀
𝑒−𝑝|𝜆|𝛽𝑑𝜆 = 2 𝛽𝑝𝛽−1 ∫
𝑡>𝑝𝑀𝛽
𝑒−𝑡𝑡𝛽−1−1𝑑𝑡
≤ 2
𝛽𝑝𝛽−1 ∫
𝑡>𝑝𝑀𝛽
𝑒−(1−𝛿)𝑡𝑑𝑡(𝑝𝑀𝛽)𝛽−1−1 𝑒𝛿𝑝𝑀𝛽 = 2
1 − 𝛿 1
𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽, where the bound is sharper if𝛿 is smaller. Choosing the minimal value of𝛿, we obtain
𝐼1≲ 1
1 − (𝛽−1− 1)/(𝑝𝑀𝛽) 1
𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽≲ 𝑀1−𝛽𝑒−𝑝𝑀𝛽, for𝑀 sufficiently large. We next choose 𝑀 = 2(log(1/𝜀))1𝛽 in order to ensure that𝐼1≤ 𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 ≥ 2𝑒𝑎𝑀 and 2−𝑘𝑝≤ 𝜀𝑝. This is satisfied if𝑘 = 2(log 𝜀−1)max(𝛽−1,1).
In the ordinary smooth case with smoothness parameter𝛽, we have the bound
𝐼1≲ ∫
|𝜆|>𝑀
|𝜆|−𝛽𝑝𝑑𝜆 ≲ (1 𝑀)𝛽𝑝−1.
We choose𝑀 = (1/𝜀)−(𝛽−1/𝑝)−1 to render the right side equal to𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 = 2𝜀−(𝛽−1/𝑝)−1.
The number of support points in the preceding lemma is increas- ing in𝑞 and decreasing in 𝛽. For approximation in the 𝐿2-norm (𝑞 = 2), the number of support points is of order 𝜀−1/(𝛽−1/2), and this re- duces to𝜀−2/3for the Laplace kernel (ordinary smooth with𝛽 = 2). A interpretation of the exponent𝛽−1/2 is the (almost) Sobolev smooth- ness of𝑝𝐺, since, for𝛼 < 𝛽 − 1/2,
∫(1 + |𝜆|2)𝛼| ̃𝑝𝐺(𝜆)|2𝑑𝜆 ≲ ∫(1 + |𝜆|2)𝛼| ̃𝑓(𝜆)|2𝑑𝜆 < ∞.
We do not have a compelling intuition for this correspondence.
The Hellinger distance is more sensitive to areas where the densi- ties are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.
Lemma 1.4. For a probability measure𝐺 supported on [−𝑎, 𝑎] there exists a discrete measure𝐺′with at most𝑁 ≍ 𝜀−2/3support points
1.4 finite approximation
such that for𝑝𝐺 = 𝑓 ∗ 𝐺 and 𝑓 the Laplace density ℎ(𝑝𝐺, 𝑝𝐺′) ≤ 𝜀.
Proof. Since𝑝𝐺(𝑥) ≥ 𝑓(|𝑥| + 𝑎) = 𝑒−𝑎𝑒−|𝑥|/2, for every 𝑥 and proba- bility measure𝐺 supported on [−𝑎, 𝑎], the Hellinger distance between Laplace mixtures satisfies
ℎ2(𝑝𝐺, 𝑝𝐺′) ≤ ∫(𝑝𝐺− 𝑝𝐺′)2
𝑝𝐺+ 𝑝𝐺′ (𝑥) 𝑑𝑥 ≤ 𝑒𝑎∫(𝑝𝐺′(𝑥) − 𝑝𝐺(𝑥))2𝑒|𝑥|𝑑𝑥.
If we write𝑞𝐺(𝑥) = 𝑝𝐺(𝑥)𝑒|𝑥|/2, and𝑞𝐺̃ for the corresponding Fourier transform, then by Plancherel’s theorem the integral in the right side is equal to
1
2𝜋∫ | ̃𝑞𝐺′− ̃𝑞𝐺|2(𝜆) 𝑑𝜆.
By an explicit computation we obtain
̃
𝑞𝐺(𝜆) = 1
2∫ ∫ 𝑒𝚤𝜆𝑥𝑒−|𝑥−𝑧|+|𝑥|/2𝑑𝑥 𝑑𝐺(𝑧) =1
2∫ 𝑟(𝜆, 𝑧) 𝑑𝐺(𝑧), where𝑟(𝜆, 𝑧) is given by
𝑟(𝜆, 𝑧) = 𝑒−𝑧
𝚤𝜆 + 1/2+ 𝑒−𝑧𝑒(𝚤𝜆+3/2)𝑧− 1
𝚤𝜆 + 3/2 −𝑒(𝚤𝜆+1/2)𝑧 𝚤𝜆 − 1/2
= 𝑒−𝑧
(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)− 2𝑒𝚤𝜆𝑧𝑒𝑧/2
(𝚤𝜆 + 3/2)(𝚤𝜆 − 1/2). (1.4) Now let𝐺′be a discrete measure on[−𝑎, 𝑎] such that
∫ 𝑒−𝑧𝑑(𝐺′− 𝐺)(𝑧) = 0,
∫ 𝑒𝑧/2𝑧𝑗𝑑(𝐺′− 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.
By Lemma A.1 in [31]𝐺′can be chosen to have at most𝑘 + 1 support points.
By the choice of𝐺′the first term of𝑟(𝜆, 𝑧) gives no contribution to the difference∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺′−𝐺)(𝑧). As the second term of 𝑟(𝜆, 𝑧) is for large|𝜆| bounded in absolute value by a multiple of |𝜆|−2, it follows that
𝐼2∶= ∫
|𝜆|>𝑀
|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺′− 𝐺)(𝑧)|
2
𝑑𝜆 ≲ ∫
𝜆>𝑀
𝜆−4𝑑𝜆 ≍ 𝑀−3.
Again by the choice of𝐺′, the integral∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺′− 𝐺)(𝑧) remains the same if we replace𝑒𝚤𝜆𝑧by𝑒𝚤𝜆𝑧− ∑𝑘𝑗=0(𝚤𝜆𝑧)𝑗/𝑗! in the second term of𝑟(𝜆, 𝑧). It follows that
𝐼1∶= ∫
|𝜆|≤𝑀
|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺′− 𝐺)(𝑧)|
2
𝑑𝜆
≤ ∫
|𝜆|≤𝑀
| 2
(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)|
2
| ∫ 𝑒𝑧/2[𝑒𝚤𝜆𝑧−
𝑘
∑
𝑗=0
(𝚤𝜆𝑧)𝑗] 𝑑(𝐺′− 𝐺)(𝑧)|
2
𝑑𝜆
≲ ∫
𝑀 0
(𝑧𝜆)2𝑘
(𝑘!)2 𝑑𝜆 ≲ (𝑎𝑒𝑀)2𝑘+1 𝑘2𝑘+1 .
It follows, by a similar argument as in the proof of Lemma 1.3, that we can reduce both𝐼1and𝐼2to𝜀2by choosing and𝑀 ≍ 𝜀−2/3and 𝑘 = 2𝑎𝑒𝑀.
1.5 entropy
We study the covering numbers of the class of mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, where𝐺 ranges over the collection ℳ[−𝑎, 𝑎] of all probability mea- sures on[−𝑎, 𝑎]. We present a bound for any 𝐿𝑟-norm and general kernels𝑓, and a bound for the Hellinger distance that is specific to the Laplace kernel. Note that𝑓(1)is the first derivative of𝑓.
Proposition 1.5. If both‖𝑓‖𝑟and‖𝑓(1)‖𝑟are finite and ̃𝑓 has ordinary smoothness𝛽, then, for 𝑝𝐺= 𝑓 ∗ 𝐺, and any 𝑟 ≥ 2,
log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ‖⋅‖𝑟) ≲ 𝜀−𝛽−1+1/𝑟1 log 𝜀−1. (1.5) Proof. Consider an𝜀-net of 𝒢𝑎 = {𝑝𝐺 ∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]} by con- structing ℐ the collection of all𝑝𝐺’s such that the mixing measure 𝐺 ∈ ℳ[−𝑎, 𝑎] is discrete and has at most 𝑁 ≤ 𝐷𝜀−(𝛽−1+𝑟−1)−1support points for some proper constant𝐷.
In light of Lemma 1.3, the set of all mixtures𝑝𝐺with𝐺 a discrete probability measure with𝑁 ≲ 𝜀−(𝛽−1+𝑟−1)−1 support points forms an 𝜀-net over the set of all mixtures 𝑝𝐺 as in the lemma. It suffices to construct an𝜀-net of the given cardinality over this set of discrete mix- tures.
1.5 entropy
By Jensen’s inequality and Fubini’s theorem,
‖𝑓(⋅ − 𝜃) − 𝑓‖𝑟= (∫|𝜃 ∫
1 0
𝑓(1)(𝑥 − 𝜃𝑠) 𝑑𝑠|𝑟𝑑𝑥)
1/𝑟
≤ ‖𝑓(1)‖𝑟𝜃.
Furthermore, for any probability vectors𝑝 and 𝑝′and locations𝜃𝑖,
‖
𝑁
∑
𝑖=1
𝑝𝑖𝑓(⋅ − 𝜃𝑖) −
𝑁
∑
𝑖=1
𝑝′𝑖𝑓(⋅ − 𝜃𝑖)‖
𝑟
≤
𝑁
∑
𝑖=1
|𝑝𝑖− 𝑝′𝑖|‖𝑓(⋅ − 𝜃𝑖)‖𝑟
= ‖𝑓‖𝑟‖𝑝 − 𝑝′‖1.
Combining these inequalities, we see that for two discrete probability measures𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺′= ∑𝑁𝑖=1𝑝′𝑖𝛿𝜃′𝑖,
‖𝑝𝐺− 𝑝𝐺′‖𝑟 ≤ ‖𝑓(1)‖𝑟max
𝑖 |𝜃𝑖− 𝜃′𝑖| + ‖𝑓‖𝑟‖𝑝 − 𝑝′‖1. (1.6) Thus we can construct an𝜀-net over the discrete mixtures by relocat- ing the support points(𝜃𝑖)𝑁𝑖=1to the nearest points(𝜃′𝑖)𝑁𝑖=1in an𝜀-net on[−𝑎, 𝑎], and relocating the weights 𝑝 to the nearest point 𝑝′in an 𝜀-net for the 𝑙1-norm over the𝑁-dimensional 𝑙1-unit simplex. This gives a set of at most
(2𝑎 𝜀 )
𝑁
(5 𝜀)
𝑁
∼ (10𝑎 𝜀2 )
𝑁
measures𝑝𝐺(cf. Lemma A.4 of [33] for the entropy of the𝑙1-unit sim- plex). This gives the bound of the lemma.
Proposition 1.6. For𝑓 the Laplace kernel and 𝑝𝐺= 𝑓 ∗ 𝐺,
log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ℎ) ≲ 𝜀−3/8log(𝜀−1). (1.7) Proof. Since the function √𝑓 is absolutely continuous with derivative 𝑥 ↦ −2−3/2𝑒−|𝑥|/2sgn(𝑥), we have by Jensen’s inequality and Fubini’s theorem that
ℎ2(𝑓, 𝑓(⋅ − 𝜃)) = ∫(𝜃 ∫
1 0
−2−3/2𝑒−|𝑥−𝜃𝑠|/2sgn(𝑥 − 𝜃𝑠) 𝑑𝑠)2𝑑𝑥
≤ 𝜃2∫
1 0
∫ 𝑒−|𝑥−𝜃𝑠|𝑑𝑥 𝑑𝑠 = 2𝜃2. It follows thatℎ(𝑓, 𝑓(⋅ − 𝜃)) ≲ 𝜃.
By convexity of the map(𝑢, 𝑣) ↦ (√𝑢 − √𝑣)2, we have
|
|
√∑𝑖
𝑝𝑖𝑓(⋅ − 𝜃𝑖) − √∑
𝑖
𝑝𝑖𝑓(⋅ − 𝜃′𝑖)|
|
2
≤ ∑
𝑖
𝑝𝑖[√𝑓(⋅ − 𝜃𝑖) − √𝑓(⋅ − 𝜃′𝑖)]2.
By integrating this inequality we see that the densities𝑝𝐺 and𝑝𝐺′ with mixing distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺′= ∑𝑁𝑖=1𝑝𝑖𝛿𝜃′𝑖 satisfy ℎ2(𝑝𝐺, 𝑝𝐺′) ≲ ∑ 𝑝𝑖|𝜃𝑖− 𝜃′𝑖|2≤ ‖𝜃 − 𝜃′‖2∞.
For distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 and𝐺′ = ∑𝑁𝑖=1𝑝′𝑖𝛿𝜃𝑖 with the same support points, but different weights, we have
ℎ2(𝑝𝐺, 𝑝𝐺′) ≤ ∫( ∑𝑖=1𝑁 (𝑝𝑖− 𝑝′𝑖)𝑓(𝑥 − 𝜃𝑖))2
∑𝑁𝑖=1(𝑝𝑖+ 𝑝′𝑖)𝑓(𝑥 − 𝜃𝑖) 𝑑𝑥
≤ ∫ (
𝑁
∑
𝑖=1
|𝑝𝑖− 𝑝′𝑖|)2𝑓2(|𝑥| − 𝑎)
2𝑓(|𝑥| + 𝑎)𝑑𝑥 ≲ ‖𝑝 − 𝑝′‖21. Therefore the bound follows by arguments similar as in the proof of Proposition 1.5, where presently we use Lemma 1.4 to determine suit- able finite approximations.
The map𝐺 ↦ 𝑝𝐺 = 𝑓 ∗ 𝐺 is one-to-one as soon as the charac- teristic function of𝑓 is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.
Proposition 1.7. For any𝑘 ≥ 1, and any sufficiently small 𝜀 > 0, log 𝑁(𝜀, ℳ[−𝑎, 𝑎], 𝑊𝑘) ≲ 𝜀−1log 𝜀−1. (1.8) The proposition is a consequence of Lemma 1.9, below, which ap- plies to the set of all Borel probability measures on a general metric space(𝛩, 𝜌) (cf. [61]).
Lemma 1.8. For any probability measure𝐺 concentrated on countably many disjoint sets𝛩1, 𝛩2, … and probability measure 𝐺′concentrated on disjoint sets𝛩′1, 𝛩′2, …,
𝑊𝑘(𝐺, 𝐺′) ≤ sup
𝑖
sup
𝜃𝑖∈𝛩𝑖 𝜃′𝑖∈𝛩𝑖′
𝜌(𝜃𝑖, 𝜃′𝑖) + diam(𝛩)( ∑
𝑖
|𝐺(𝛩𝑖) − 𝐺′(𝛩𝑖′)|)1/𝑘.
1.6 prior mass
In particular,
𝑊𝑘( ∑
𝑖
𝑝𝑖𝛿𝜃𝑖, ∑
𝑖
𝑝′𝑖𝛿𝜃′𝑖) ≤ max
𝑖 𝜌(𝜃𝑖, 𝜃′𝑖) + diam(𝛩)‖𝑝 − 𝑝′‖1/𝑘1 . Proof. For𝑝𝑖= 𝐺(𝛩𝑖) and 𝑝′𝑖 = 𝐺′(𝛩𝑖′) divide the interval [0, ∑𝑖𝑝𝑖∧ 𝑝′𝑖] into disjoint intervals 𝐼𝑖of lengths𝑝𝑖∧ 𝑝′𝑖. We couple variables
̄𝜃 and ̄𝜃′by an auxiliary uniform variable𝑈. If 𝑈 ∈ 𝐼𝑖, then generate
̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and ̄𝜃′∼ 𝐺′(⋅|𝛩′𝑖). Divide the remaining interval [∑𝑖𝑝𝑖∧ 𝑝′𝑖, 1] into intervals 𝐽𝑖of lengths𝑝𝑖− 𝑝𝑖∧ 𝑝′𝑖and, separately, intervals 𝐽𝑖′of length𝑝′𝑖 − 𝑝𝑖∧ 𝑝′𝑖. If𝑈 ∈ 𝐽𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and if𝑈 ∈ 𝐽𝑖′, then generate ̄𝜃′ ∼ 𝐺′(⋅|𝛩𝑖′). Then ̄𝜃 and ̄𝜃′have marginal distributions𝐺 and 𝐺′, and
𝔼𝜌𝑘( ̄𝜃, ̄𝜃′) ≤ 𝔼[𝜌𝑘( ̄𝜃, ̄𝜃′)1𝑈≤∑
𝑖𝑝𝑖∧𝑝′𝑖] + diam(𝛩)𝑘ℙ(𝑈 > ∑
𝑖
𝑝𝑖∧ 𝑝′𝑖).
The first term is bounded by the𝑘-th power of the first term of the lemma, while the probability in the second term is equal to1 − ∑𝑖𝑝𝑖∧ 𝑝′𝑖 = ∑𝑖|𝑝𝑖− 𝑝′𝑖|/2.
Lemma 1.9. For the set ℳ(𝛩) of all Borel probability measures on a metric space(𝛩, 𝜌), any 𝑘 ≥ 1, and 0 < 𝜀 < min{2/3, diam(𝛩)},
𝑁(𝜀, ℳ(𝛩), 𝑊𝑘) ≤ (4 diam(𝛩)
𝜀 )𝑘𝑁(𝜀,𝛩,𝜌).
Proof. For a minimal𝜀-net over 𝛩 of 𝑁 = 𝑁(𝜀, 𝛩, 𝜌) points, let 𝛩 =
∪𝑖𝛩𝑖be the partition obtained by assigning each𝜃 to a closest point.
For any𝐺 let 𝐺𝜀 = ∑𝑖𝐺(𝛩𝑖)𝛿𝜃𝑖, for arbitrary but fixed𝜃𝑖 ∈ 𝛩𝑖. Since 𝑊𝑘(𝐺, 𝐺𝜀) ≤ 𝜀 by Lemma 1.8, 𝑁(2𝜀, ℳ(𝛩), 𝑊𝑘) ≤ 𝑁(𝜀, ℳ𝜀, 𝑊𝑘) holds for ℳ𝜀 the set of all𝐺𝜀. We next form the measures𝐺𝜀,𝑝 =
∑𝑖𝑝𝑖𝛿𝜃𝑖for(𝑝1, … , 𝑝𝑁) ranging over an (𝜀/ diam(𝛩))𝑘-net for the𝑙1- distance over the𝑁-dimensional unit simplex. By Lemma 1.8 every 𝐺𝜀 is within𝑊𝑘-distance of some𝐺𝜀,𝑝. Therefore the proof is com- pleted because𝑁(𝜀, ℳ𝜀, 𝑊𝑘) is bounded from above by the number of points𝑝, which is bounded by (4 diam(𝛩)/𝜀)𝑘𝑁(cf. Lemma A.4 in [31]).
1.6 prior mass
This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neigh-
bourhood of a mixture𝑝𝐺0.
Proposition 1.10. If𝛱 is the Dirichlet process DP(𝛼) with base mea- sure𝛼 that has a Lebesgue density bounded away from 0 and ∞ on its support[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then for every sufficiently small𝜀 > 0 and every probability measure 𝐺0on[−𝑎, 𝑎],
log 𝛱(𝐺 ∶ 𝐾(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2, 𝐾2(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2) ≳ (1
𝜀)2/3log (1 𝜀).
Proof. By Lemma 1.4 there exists a discrete measure𝐺1 with𝑁 ≲ 𝜀−2/3support points such thatℎ(𝑝𝐺0, 𝑝𝐺1) ≤ 𝜀. We may assume that the support points of𝐺1are at least2𝜀2-separated. If not, we take a maximal2𝜀2-separated set in the support points of𝐺1, and replace 𝐺1by the discrete measure obtained by relocating the masses of𝐺1 to the nearest points in the2𝜀2-net. Thenℎ(𝑝𝐺1, 𝑝𝐺′1) ≲ 𝜀2, as seen in the proof of Proposition 1.6.
Now by Lemmas 1.11 and 1.12, if𝐺1= ∑𝑁𝑖=1𝑝𝑗𝛿𝑧𝑗, with the support points𝑧𝑗at least2𝜀2-separated,
{𝐺 ∶ max(𝐾, 𝐾2)(𝑝𝐺0, 𝑝𝐺) < 𝑑1𝜀2} ⊃ {𝐺 ∶ ℎ(𝑝𝐺0, 𝑝𝐺) ≤ 2𝜀}
⊃ {𝐺 ∶ ℎ(𝑝𝐺1, 𝑝𝐺) ≤ 𝜀}
⊃ {𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺1‖1≤ 𝑑2𝜀2}
⊃ {𝐺 ∶
𝑁
∑
𝑗=1
|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2}.
Since the base measure𝛼 has density bounded away from zero and infinity on[−𝑎, 𝑎] by assumption, by Lemma A.2 of [31], we have
log 𝛱(𝐺 ∶
𝑁
∑
𝑗=1
|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2) ≳ −𝑁 log 𝜀−1.
The lemma follows upon combining the preceding.
Lemma 1.11. If𝐺′ = ∑𝑁𝑗=1𝑝𝑖𝛿𝑧𝑗 is a probability measure supported on points𝑧1, … , 𝑧𝑁inℝ with |𝑧𝑗 − 𝑧𝑘| > 2𝜀 for 𝑗 ≠ 𝑘, then for any probability measure𝐺 on ℝ and kernel 𝑓 with its derivative 𝑓(1),
‖𝑝𝐺− 𝑝𝐺′‖1≤ 2‖𝑓(1)‖1𝜀 + 2
𝑁
∑
𝑗=1
|𝐺[𝑧𝑗− 𝜀, 𝑧𝑗+ 𝜀] − 𝑝𝑗|.
1.7 proof of theorem 1.1
Lemma 1.12. If𝐺 and 𝐺′are probability measures on[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then
ℎ2(𝑝𝐺, 𝑝𝐺′) ≲ ‖𝑝𝐺− 𝑝𝐺′‖2, (1.9) max (𝐾(𝑝𝐺, 𝑝𝐺′), 𝐾2(𝑝𝐺, 𝑝𝐺′)) ≲ ℎ2(𝑝𝐺, 𝑝𝐺′). (1.10) Proofs. The first lemma is a generalization of Lemma 4 in [33] from normal to general kernels, and is proved in the same manner. We omit further details.
In view of the shape of the Laplace kernel, it is easy to see that for 𝐺 compactly supported on [−𝑎, 𝑎],
𝑓(|𝑥| + 𝑎) ≤ 𝑝𝐺(𝑥) ≤ 𝑓(|𝑥| − 𝑎), We bound the squared Hellinger distance as follows:
ℎ2(𝑝𝐺, 𝑝𝐺′) ≤ ∫(𝑝𝐺− 𝑝𝐺′)2 𝑝𝐺+ 𝑝𝐺′ 𝑑𝑥
≤ ∫
|𝑥|≤𝐴
𝑒𝐴+𝑎(𝑝𝐺− 𝑝𝐺′)2𝑑𝑥 + ∫
|𝑥|>𝐴
(𝑝𝐺+ 𝑝𝐺′)𝑑𝑥
≲ 𝑒𝑎‖𝑝𝐺− 𝑝𝐺′‖22𝑒𝐴+ 𝑒−𝐴.
By the elementary inequality𝑡 +𝑢𝑡 ≥ 2√𝑢, for 𝑢, 𝑡 > 0, we obtain (1.9) upon choosing𝐴 = min(𝑎, log ‖𝑝𝐺− 𝑝𝐺′‖−12 − 𝑎/2).
For the proof of the second assertion we first note that, if both𝐺 and𝐺′are compactly supported on[−𝑎, 𝑎],
𝑝𝐺(𝑥)
𝑝𝐺′(𝑥)≤ 𝑓(|𝑥| − 𝑎) 𝑓(|𝑥| + 𝑎) ≤ 𝑒2𝑎.
Therefore‖𝑝𝐺/𝑝𝐺′‖∞ ≤ 𝑒2𝑎, and (1.10) follows by Lemma 8 in [33].
1.7 proof of theorem 1.1
The basic theorem of [30] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger dis- tance. In the present situation for the proofs of Theorems 1.1 and 1.2, we would like to apply this result to a power smaller than one of the Wasserstein metric and a power smaller than one of the𝐿𝑞distance, respectively, both of which are not metrics.
Consider a general “discrepancy measure”𝑑, which is a map 𝑑 ∶ 𝒫 × 𝒫 → ℝ on the product of the set of densities on a given mea-
surable space and itself, which has the properties, for some constant 𝐶 > 0:
(a) 𝑑(𝑥, 𝑦) ≥ 0;
(b) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;
(c) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);
(d) 𝑑(𝑥, 𝑦) ≤ 𝐶(𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧)).
Thus𝑑 is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant𝐶, possibly bigger than 1. Call a set of the form {𝑥 ∶ 𝑑(𝑥, 𝑦) < 𝑐} a 𝑑-ball, and define covering numbers𝑁(𝜀, 𝒫 , 𝑑) relative to 𝑑 as usual.
Let𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) be the posterior distribution of 𝑝 given an i.i.d. sample𝑋1, … , 𝑋𝑛from a density𝑝 that is equipped with a prior probability distribution𝛱.
Theorem 1.13. Suppose𝑑 has the properties as given, satisfies 𝑑(𝑝0, 𝑝) ≤ ℎ(𝑝0, 𝑝) for every 𝑝 ∈ 𝒫 , and the sets {𝑝 ∶ 𝑑(𝑝, 𝑝′) < 𝛿} are convex.
Then𝛱𝑛(𝑑(𝑝, 𝑝0) > 𝑀𝜀𝑛∣ 𝑋1, … , 𝑋𝑛) → 0 in 𝑃0𝑛-probability for any 𝜀𝑛such that𝑛𝜀2𝑛→ ∞ and such that, for positive constants 𝑐1,𝑐2and sets 𝒫𝑛⊂ 𝒫 ,
log 𝑁(𝜀𝑛, 𝒫𝑛, 𝑑) ≤ 𝑐1𝑛𝜀𝑛2, (1.11) 𝛱𝑛(𝑝 ∶ 𝐾(𝑝0, 𝑝) < 𝜀2𝑛, 𝐾2(𝑝0, 𝑝) < 𝜀2𝑛) ≥ 𝑒−𝑐2𝑛𝜀2𝑛, (1.12) 𝛱𝑛(𝒫 − 𝒫𝑛) ≤ 𝑒−(𝑐2+4)𝑛𝜀2𝑛. (1.13) We defer the proof until Appendix A.
The proof of Theorem 1.1 is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem2 in [61]. We choose constant 𝐶𝑘carefully to make sure that the map𝜀 ↦ 𝜀[log(𝐶𝑘/𝜀)]𝑘+1/2is monotone on(0, 2].
Lemma 1.14. For probability measures𝐺 and 𝐺′supported on[−𝑎, 𝑎], and𝑝𝐺 = 𝑓∗𝐺 for a probability density 𝑓 with inf𝜆(1+|𝜆|𝛽)| ̃𝑓(𝜆)| > 0, and any𝑘 ≥ 1,
𝑊𝑘(𝐺, 𝐺′) ≲ ℎ(𝑝𝐺, 𝑝𝐺′)1/(𝑘+𝛽)(log 𝐶𝑘
ℎ(𝑝𝐺, 𝑝𝐺′))(𝑘+1/2)/(𝑘+𝛽)
.
Proof. By Theorem 6.15 in [76] the Wasserstein distance𝑊𝑘(𝐺, 𝐺′) is bounded above by a multiple of the𝑘th root of ∫ |𝑥|𝑘𝑑|𝐺 − 𝐺′|(𝑥), where|𝐺 − 𝐺′| is the total variation measure of the difference 𝐺 −
1.7 proof of theorem 1.1
𝐺′. We apply this to the convolutions of𝐺 and 𝐺′with the normal distribution𝛷𝛿with mean0 and variance 𝛿2, to find, for every𝑀 > 0,
𝑊𝑘(𝐺 ∗ 𝛷𝛿,𝐺′∗ 𝛷𝛿)𝑘≲ ∫ |𝑥|𝑘|(𝐺 − 𝐺′) ∗ 𝜙𝛿(𝑥)| 𝑑𝑥
≤ (∫
𝑀
−𝑀
𝑥2𝑘𝑑𝑥 ∫
𝑀
−𝑀
|(𝐺 − 𝐺′) ∗ 𝜙𝛿(𝑥)|2𝑑𝑥)1/2 + 𝑒−𝑀∫
|𝑥|>𝑀
|𝑥|𝑘𝑒|𝑥||(𝐺 − 𝐺′) ∗ 𝜙𝛿(𝑥)| 𝑑𝑥
≲ 𝑀𝑘+1/2‖(𝐺 − 𝐺′) ∗ 𝜙𝛿‖2+ 𝑒−𝑀𝑒2|𝑎|𝑒2|𝛿𝑍|,
where𝑍 is a standard normal variable. The number 𝐾𝛿∶= 𝑒2|𝑎|𝔼𝑒2|𝛿𝑍|
is uniformly bounded if𝛿 ≤ 𝛿𝑘, for some fixed𝛿𝑘. By Plancherel’s theorem,
‖(𝐺 − 𝐺′) ∗ 𝜙𝛿‖22= ∫ | ̃𝐺 − ̃𝐺′|2(𝜆) ̃𝜙2𝛿(𝜆) 𝑑𝜆
= ∫ | ̃𝑓( ̃𝐺 − ̃𝐺′)|2(𝜆) 𝜙𝛿̃2
| ̃𝑓|2(𝜆) 𝑑𝜆
≲ ‖𝑝𝐺− 𝑝𝐺′‖22sup
𝜆
̃𝜙2𝛿
| ̃𝑓|2(𝜆) ≲ ℎ2(𝑝𝐺, 𝑝𝐺′)𝛿−2𝛽, where we have again applied Plancherel’s theorem, used that the𝐿2- metric on uniformly bounded densities is bounded by the Hellinger distance, and the assumption on the Fourier transform of𝑓, which shows that( ̃𝜙𝛿/| ̃𝑓|)(𝜆) ≲ (1 + |𝜆|𝛽)𝑒−𝛿2𝜆2/2≲ 𝛿−𝛽.
If𝑈 ∼ 𝐺 is independent of 𝑍 ∼ 𝑁(0, 1), then (𝑈, 𝑈 + 𝛿𝑍) gives a coupling of𝐺 and 𝐺 ∗ 𝛷𝛿. Therefore the definition of the Wasserstein metric gives that𝑊𝑘(𝐺, 𝐺 ∗ 𝛷𝛿)𝑘≤ 𝔼|𝛿𝑍|𝑘 ≲ 𝛿𝑘.
Combining the preceding inequalities with the triangle inequality we see that, for𝛿 ∈ (0, 𝛿𝑘] and any 𝑀 > 0,
𝑊𝑘(𝐺, 𝐺′)𝑘 ≲ 𝑀𝑘+1/2ℎ(𝑝𝐺, 𝑝𝐺′)𝛿−𝛽+ 𝑒−𝑀+ 𝛿𝑘.
The lemma follows by optimizing this over𝑀 and 𝛿. Specifically, for 𝜀 = ℎ(𝑝𝐺, 𝑝𝐺′), 𝑀 = 𝑘/(𝑘 + 𝛽) log(𝐶𝑘/𝜀) and 𝛿 = (𝑀𝑘+1/2𝜀)1/(𝑘+𝛽) are eligible choices for
𝛿𝑘= sup
𝜀∈(0,2]
[ 𝑘
𝑘 + 𝛽log𝐶𝑘
𝜀 ](𝑘+1/2)/(𝑘+𝛽)
𝜀1/(𝑘+𝛽),
which is indeed a finite number. In fact the supremum is taken at𝜀 = 2, by the assumption on𝐶𝑘.
For the Laplace kernel𝑓 we choose 𝛽 = 2 in the preceding lemma, and then obtain that𝑑(𝑝𝐺, 𝑝𝐺′) ≤ ℎ(𝑝𝐺, 𝑝𝐺′), for the discrepancy 𝑑 = 𝛾−1(𝑊𝑘), and 𝛾(𝜀) = 𝐷𝑘𝜀1/(𝑘+𝛽)[log(𝐶𝑘/𝜀)](𝑘+1/2)/(𝑘+𝛽)a multiple of the (monotone) transformation in the right side of the preceding lemma. For small values of𝑊𝑘(𝐺1, 𝐺2) we have
𝑑(𝑝𝐺1, 𝑝𝐺2) ≍ 𝑊𝑘𝑘+2(𝐺1, 𝐺2)(log 1
𝑊𝑘(𝐺1, 𝐺2))−𝑘−1/2. (1.14) As𝑘 + 2 > 1 the discrepancy 𝑑 may not satisfy the triangle inequality, but it does possess the properties (a)–(d) of non-metrics in the para- graphs before Theorem 1.13. The balls of the discrepancy𝑑 are convex, as the Wasserstein metrics are convex (see [76]).
It follows that Theorem 1.13 applies to obtain a rate of posterior contraction relative to𝑑 and hence relative to
𝑊𝑘∼ 𝑑1/(𝑘+2)(log(1/𝑑))(𝑘+1/2)/(𝑘+2).
We apply the theorem with 𝒫 = 𝒫𝑛equal to the set of mixtures𝑝𝐺= 𝑓 ∗ 𝐺, as 𝐺 ranges over ℳ[−𝑎, 𝑎]. Thus (1.13) is trivially satisfied.
For the entropy condition (1.11), by Proposition 1.7, we have log 𝑁(𝜀, 𝒫𝑛, 𝑑) = log 𝑁(𝜀1/(𝑘+2)(log1
𝜀)(𝑘+1/2)/(𝑘+2)
, ℳ[−𝑎, 𝑎], 𝑊𝑘)
≲ (1
𝜀)1/(𝑘+2)(log1
𝜀)1+(𝑘+1/2)/(𝑘+2)
.
Thus (1.11) holds for the rate𝜀𝑛≳ 𝑛−𝛾, for every𝛾 < (𝑘 + 2)/(2𝑘 + 5).
In view of Proposition 1.10, the prior mass condition (1.12) is sat- isfied with the rate𝜀𝑛≍ (log 𝑛/𝑛)3/8.
Theorem 1.13 yields a rate of contraction relative to𝑑 equal to the slower of the two rates, which is(log 𝑛/𝑛)3/8. This translates into the rate for the Wasserstein distance as given in Theorem 1.1.
1.8 proof of theorem 1.2
We apply Theorem 1.13, with 𝒫 = 𝒫𝑛the set of all mixtures𝑝𝐺as 𝐺 ranges over ℳ[−𝑎, 𝑎]. For 𝑑 = ℎ the rate follows immediately by combining Propositions 1.6 and 1.10.
Since the densities𝑝𝐺are uniformly bounded by1/2, the 𝐿𝑞dis- tance‖𝑝𝐺− 𝑝𝐺′‖𝑞is bounded above by a multiple ofℎ(𝑝𝐺, 𝑝𝐺′)2/𝑞. We can therefore apply Theorem 1.13 with the discrepancy𝑑(𝑝, 𝑝′) =
1.9 normal mixtures
‖𝑝 − 𝑝′‖𝑞/2𝑞 . In view of Proposition 1.5
log 𝑁(𝜀, 𝒫𝑛, 𝑑) ≲ 𝜀−2/(𝑞+1)log 𝜀−1.
Therefore setting𝜀𝑛 ≍ (log 𝑛/𝑛)(𝑞+1)/(2𝑞+4)fulfills the entropy condi- tion (1.11). By Proposition 1.10 the prior mass condition is satisfied for 𝜀𝑛≍ (log 𝑛/𝑛)3/8. By Theorem 1.13 the rate of contraction relative to 𝑑 is the slower of these two rates, which is the first. The rate relative to the𝐿𝑞-norm is the(2/𝑞)-th power of this rate.
1.9 normal mixtures
We reproduce the results on normal mixtures from [31], but in𝐿2- norm. Note that the normal kernel is supersmooth with𝛽 = 2, by the Approximation Lemma 1.3, for any measure𝐺1compactly supported on[−𝑎, 𝑎] we can always find a discrete measure 𝐺2with number of support points of order𝑁 ≍ log 𝜀−1such that‖𝑝𝐺1− 𝑝𝐺2‖2 ≤ 𝜀. By Lemma 4 in [33], we establish
ℎ2(𝑝𝐺1, 𝑝𝐺2) ≲ ‖𝑝𝐺1− 𝑝𝐺2‖2.
Following the same procedure as before, assuming𝐺0is the true measure, we obtain for prior mass condition
log 𝛱 (𝐺 ∶ max(𝑃𝐺0log𝑝𝐺0
𝑝𝐺, 𝑃𝐺0(log𝑝𝐺0
𝑝𝐺)2) ≤ 𝜀2) ≳ −( log 𝜀−1)2, Thus we obtain𝜀𝑛= log 𝑛/√𝑛.
By Lemma 1.5, we have the following estimate for entropy condi- tion
log 𝑁(𝜀, 𝒢𝑎, ‖ ⋅ ‖2) ≲ (log 𝜀−1)2.
This coincides with the estimate of prior mass condition, thus we ob- tain the rate of𝜀𝑛 = log 𝑛/√𝑛 with respect to 𝐿2-norm. This is the same with what is obtained in [31], only in𝐿2-norm. However we lose a √log 𝑛-factor comparing to [77], which is √log 𝑛/𝑛.