Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

(1)

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

Fengnan Gao

¶ BAYES & NETWORKS

Shanghai, April 2017

(3)

Fengnan Gao: Bayes & Networks, Dirichlet-Laplace Deconvolution and Statistical Inference in Preferential Attachment Networks, © April 2017

The author designed the cover by himself. The bottom-right corner lies a phoenix, which Feng in the author’s name stands for in the Chi- nese language.

Title Page: The decoration on the margin was modified from the code published on TEX StackExchange by Gonzalo Medina. The beautiful network illustration was distributed by Till Tantau, the author of TikZ under the gnu Free Documentation License.

All rights reserved. No part of this publication may be reproduced in any form or by any electronical or mechanical means including informaiton storage and retrieval systems without the prior written permission from the author.

A catalogue record is available from the Leiden University Library.

The research in the dissertation was supported by the Netherlands Organization for Scientific Research (NWO).

(4)

Bayes & Networks

Proefschrift

ter verkrijging van

de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus prof. mr. C. J. J. M. Stolker,

volgens besluit van het College voor Promoties te verdedigen op dinsdag 23 mei 2017

klokke 10:00 uur

door

Fengnan Gao geboren te Jiangsu, China

in 1988

(5)

Samenstelling van de promotiecommissie:

Promotor:

Prof. dr. A. W. van der Vaart (Universiteit Leiden) Overige Leden:

Prof. dr. B. de Smit (Universiteit Leiden, voorzitter) Prof. dr. J. J. Meulman (Universiteit Leiden, secretaris) Prof. dr. R. W. van der Hofstad (TU Eindhoven) Prof. dr. J. H. van Zanten (Universiteit van Amsterdam) Dr. R. M. Castro (TU Eindhoven)

Dr. A. J. Schmidt-Hieber (Universiteit Leiden)

(6)

To my family

(7)

回首向来萧瑟处归去

也无风雨也无晴

苏轼

宋神宗元丰五年

Looking back over the bleak passage survived, The return in time,

Shall not be affected by windswept rain or sunshine.

Su Shi (1082)

(8)

C O N T E N T S

i nonparametric bayesian dirichlet-laplace deconvolution 1

1 posterior contraction rates for deconvolution of dirichlet-lapalace mixtures 3 1.1 Introduction 3

1.2 Notation and Preliminaries 6 1.3 Main Results 7

1.4 Finite Approximation 8

1.5 Entropy 12

1.6 Prior Mass 15

1.7 Proof of Theorem 1.1 17 1.8 Proof of Theorem 1.2 20 1.9 Normal Mixtures 21

ii statistical inference in preferential attachment networks 23

2 introduction to networks 25

2.1 Network Science 25

2.1.1 The emergence of network science 25 2.1.2 Fundamentals of graph theory 29 2.1.3 Properties of typical networks 30 2.2 Preferential Attachment Networks 31

2.2.1 History and motivation of the pa networks 31 2.2.2 A rather general pa model 34

2.2.3 The linear pa models with random initial degrees 35

2.2.4 The general sublinear pa models 35 2.2.5 The general sublinear parametric pa mod-

els 35

3 estimatation of general pa networks 37 3.1 Introduction 37

3.2 Empirical Estimator 39 3.3 Branching Process 41

3.3.1 Rooted ordered tree 42 3.3.2 Branching process 42

3.3.3 The continuous random tree model 44 3.4 Consistency 46

3.5 Simulation Studies 49

3.5.1 Sample variance study 51 3.5.2 Asymptotic normality? 52

(9)

Contents

4 estimation of affine pa networks 57 4.1 Introduction and Notation 57

4.2 Construction of the MLE 61 4.3 Consistency 65

4.4 Asymptotic Normality 71

4.5 Local Asymptotic Normality and Efficiency 75 4.6 The Case of Fixed Initial Degree 76

4.7 Quasi-Maximum-Likelihood Estimator 76 4.8 Simulation Study 81

4.8.1 On the shoulder of the giants 82 4.8.2 The majority rules 85

5 estimation of parametric pa networks 87 5.1 Introduction and Notation 87

5.2 Construction of the MLE 88 5.3 Consistency 90

5.4 Asymptotic Normality 102

5.5 A Remedy to a Historical Problem 107

iii modeling the dynamics of the movie-actor network 109

6 modeling the dynamics of the movie-actor network of the internet movie database 111 6.1 Introduction 111

6.2 Conceptual Model Description 113 6.3 Empirical Fitting to the IMDb Dataset 115

6.3.1 Movie sizes 115

6.3.2 Number of new actors 117 6.3.3 PA function 117

6.3.4 PA function on movie degrees 123 6.3.5 Model fitting 124

6.4 Simulations 124 6.5 Theoretical Study 128

6.6 Conclusion and Future work 130 iv appendix 131

a dirichlet processes and contraction rates relative to non-metrics 133

a.1 Dirichlet Processes 133

a.2 Contraction Rates Relative to Non-metrics 133 b convergence to a power law of the movie de-

grees in the pam-imdb model 135 b.1 Introduction and Heuristics 135 b.2 Proof of Theorem B.1 138

viii

(10)

Contents

b.2.1 Concentration around the mean 139 b.2.2 Identification of the mean sequence 141

bibliography 157 summary 165 samenvatting 167 acknowledgements 169 curriculum vitæ 171 colophon 173

(11)

L I S T O F F I G U R E S

Figure 2.1 Seven Bridges of Königsberg 28 Figure 3.1 Boxplots of ee’s in different settings. 50 Figure 3.2 Sample Variance Study of EE 52 Figure 3.3 QQ-Plots of Empirical Estimators 53 Figure 3.4 Histogram of Rescaled Empirical Estima-

tor 54

Figure 3.5 Estimated Density of √𝑛( ̂𝑟₂(𝑛) − 𝑟₂) with different network sizes𝑛 55

Figure 4.1 loglog Plot of Empirical Degree Distribu- tion vs. Degree in PA Networks 83

Figure 4.2 Histogram of MLE in Affine PA networks 84 Figure 6.1 Histogram of movie sizes in 1947 115 Figure 6.2 loglog-Histogram of Movie Sizes 116 Figure 6.3 loglog-histogram of all movie sizes until

the end of 2007 116

Figure 6.4 Ratio of New Actors of Movies in 1971 118 Figure 6.6 Actor Degree Evolution 120

Figure 6.7 Fitting a straight line on the loglog-Histogrram on Actor Degrees 120

Figure 6.8 log(1 − ̂𝐹_𝑁(𝑘))-vs.-log 𝑘 Plot of Actor De- grees 121

Figure 6.9 Movie Degree Evolution 122

Figure 6.10 Fitting a straight line on the loglog-histogram starting from𝑘 = 40 on movie degrees 122 Figure 6.11 log (1 − ̂𝐹_𝑁(𝑘)) vs. log 𝑘 plot of movie de-

grees in year 1947 123

Figure 6.12 Histogram of Movie Degrees by 1950 125 Figure 6.13 Movie Degree Comparisons Between Sim-

ulation and Real Dataset 126 Figure 6.14 Actor Degree Comparison between Sim-

ulation and Read Dataset 127

L I S T O F TA B L E S

Table 2.1 Representative Networks 26 Table 4.1 Summary of the Performance of the MLE

in Affine PA Networks 84

x

(12)

Part I

N O N PA R A M E T R I C BAY E S I A N

D I R I C H L E T- L A P L A C E D E C O N VO L U T I O N

(13)

(14)

1

P O S T E R I O R C O N T R A C T I O N R AT E S F O R D E C O N V O L U T I O N O F D I R I C H L E T- L A PA L A C E M I X T U R E S

1.1 introduction

Consider statistical inference using the following nonparametric hi- erarchical Bayesian model for observations𝑋₁, … , 𝑋_𝑛:

(i) A probability distribution𝐺 on ℝ is generated from the Dirich- let process priorDP(𝛼) with base measure 𝛼.

(ii) An iid sample𝑍₁, … , 𝑍_𝑛is generated from𝐺.

(iii) An iid sample𝑒₁, … , 𝑒_𝑛is generated from a known density𝑓, independent of the other samples.

(iv) The observations are𝑋_𝑖= 𝑍_𝑖+ 𝑒_𝑖, for𝑖 = 1, … , 𝑛.

In this setting the conditional density of the data𝑋₁, … , 𝑋_𝑛given𝐺 is a sample from the convolution

𝑝_𝐺= 𝑓 ∗ 𝐺

of the density𝑓 and the measure 𝐺. The scheme defines a conditional distribution of𝐺 given the data 𝑋₁, … , 𝑋_𝑛, the posterior distribution of𝐺, and consequently also posterior distributions for quantities that derive from𝐺, including the convolution density 𝑝_𝐺. We are inter- ested in whether this posterior distribution can recover a true mixing distribution𝐺₀if the observations𝑋₁, … , 𝑋_𝑛are in reality a sample from the mixed distribution𝑝_𝐺₀, for some given probability distribu- tion𝐺₀.

The main contribution of this chapter is for the case that𝑓 is the Laplace density𝑓(𝑥) = 𝑒^−|𝑥|/2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy ([42]) and clustering problem in genetics ([7]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.

We consider two notions of recovery. The first notion measures the distance between the posterior of𝐺 and 𝐺₀through the Wasser- stein metric

𝑊_𝑘(𝐺, 𝐺^′) = inf

𝛾∈𝛤(𝐺,𝐺^′)(∫ |𝑥 − 𝑦|^𝑘𝑑𝛾(𝑥, 𝑦))^1/𝑘,

(15)

dirichlet-laplace mixtures

where𝛤(𝐺, 𝐺^′) is the collection of all couplings 𝛾 of 𝐺 and 𝐺^′into a bivariate measure with marginals𝐺 and 𝐺^′(i.e. if(𝑥, 𝑦) ∼ 𝛾, then 𝑥 ∼ 𝐺 and 𝑦 ∼ 𝐺^′), and𝑘 ≥ 1. The Wasserstein metric is a clas- sical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the total variation distance (which is more natural as a distance on densities), can be interpreted through transportation of measure (see [76]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by [61], within a general mixing framework. We refer to this paper for further motivation of the Wasserstein metric for mixtures, and to [76] for general background on the Wasserstein metric. In this chapter we improve the upper bound on posterior contraction rates given in [61], at least in the case of the Laplace mixtures, obtaining a rate of nearly𝑛^−1/8 for𝑊₁(and slower rates for𝑘 > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribution by non-Bayesian methods is given in [80]. It is not clear from our result whether the upper bound𝑛^−1/8is sharp.

The second notion of recovery measures the distance of the posterior of𝐺 to 𝐺₀indirectly through the Hellinger or𝐿_𝑞-distances between the mixed densities𝑝_𝐺and𝑝_𝐺₀. This is equivalent to studying the estimation of the true density𝑝_𝐺₀of the observations through the density𝑝_𝐺under the posterior distribution. As the Laplace kernel𝑓 has Fourier transform

̃𝑓(𝜆) = 1 1 + 𝜆²,

it follows that the mixed densities𝑝_𝐺have Fourier transforms satisfy- ing

| ̃𝑝_𝐺(𝜆)| ≤ 1 1 + 𝜆².

Estimation of a density with a polynomially decaying Fourier transform was first considered in [77]. According to their Theorem in Sec- tion 3A, a suitable kernel estimator possesses a root mean square er- ror of𝑛^−3/8with respect to the𝐿₂-norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the usual rate𝑛^{−𝛼/(2𝛼+1)}of nonparametric estimation for smoothness𝛼 = 3/2. This is understandable as | ̃𝑝(𝜆)| ≲ 1/(1 + |𝜆|²) implies that ∫(1 +

|𝜆|²)^𝛼| ̃𝑝(𝜆)|²𝑑𝜆 < ∞, for every 𝛼 < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class

4

(16)

1.1 introduction

of regularity𝛼 < 3/2. Indeed in [34], the rate 𝑛^{−𝛼/(2𝛼+1)}is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In this chapter we show that the posterior distribution of Laplace mixtures𝑝_𝐺contracts to𝑝_𝐺₀at the rate𝑛^−3/8up to a logarithm factor, relative to the𝐿₂-norm and Hellinger distance, and also establish rates for other𝐿_𝑞-metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate𝑛^−1/3(the usual nonparametric rate for𝛼 = 1), which is slower than 𝑛^−3/8, and hence this insight is misleading.

Besides recovery relative to the Wasserstein metric and the in- duced metrics on𝑝_𝐺, one might consider recovery relative to a metric on the distribution function on𝐺. Frequentist recovery rates for this problem were obtained in [27] under some restrictions. There is no simple relation between these rates and rates for the other metrics. The same is true for the rates for deconvolution of densities, as in [27]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of𝐺.

Contraction rates for Dirichlet mixtures of the normal kernel were considered in [31, 33, 44, 71, 72]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of ap- proximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass concentration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further derive a relationship of these metrics with a power of the Hellinger distance, and next apply a variant of the contraction theorem in [30], whose proof is included in the appendix of the dissertation. Contrac- tion rates of mixtures with other priors than the Dirichlet were considered in [71]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A general approach to posterior contraction rates in inverse problems can be found in [41], and results specific to deconvolution can be found in [24]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, and hence their results are not directly comparable to the results in this dissertation.

The papers [28, 49] consider recovery of a mixing density relative

(17)

to the𝐿_𝑝-norm in the frequentist setting. If the smoothness of the mixing density degenerates to0, then the minimax rate decreases to a constant and it is not possible to find a consistent estimator. In this chapter we show that in the same problem but viewed as a deconvolution problem on distributions, endowed with the weaker Wasserstein distance, we may obtain polynomial rates for the mixing distribution without any smoothness assumption on the distribution. In particular, for any mixing distribution it is possible to construct a consistent estimator.

The chapter is organized as follows. In the next section we give notation and preliminaries. We state in Section 1.3 the main results of the chapter, which are proved in the subsequent sections. In Section 1.4 we establish suitable finite approximations relative to the𝐿_𝑞- and He- llinger distances. The𝐿_𝑞-approximations also apply to other kernels than the Laplace kernel, and are in terms of the tail decay of the kernel’s characteristic function. In Sections 1.5 and 1.6 we apply these approximations to obtain bounds on the entropy of the mixtures relative to the𝐿_𝑞, Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 1.7 and 1.8 contain the proofs of the main results.

1.2 notation and preliminaries

Throughout the chapter integrals given without limits are considered to be integrals over the real lineℝ. The 𝐿_𝑞-norm is denoted by

‖𝑔‖_𝑞 = (∫ |𝑔(𝑥)|^𝑞𝑑𝑥)^1/𝑞,

with‖⋅‖_∞being the uniform norm. The Hellinger distance on the space of densities is given by

ℎ(𝑓, 𝑔) = (∫(𝑓^1/2(𝑥) − 𝑔^1/2(𝑥))²𝑑𝑥)^1/2.

It is easy to see thatℎ²(𝑓, 𝑔) ≤ ‖𝑓 − 𝑔‖₁≤ 2ℎ(𝑓, 𝑔), for any two probability densities𝑓 and 𝑔. Furthermore, if the densities 𝑓 and 𝑔 are uniformly bounded by a constant𝑀, then ‖𝑓 − 𝑔‖₂ ≤ 2√𝑀ℎ(𝑓, 𝑔).

The Kullback-Leibler discrepancy and corresponding variance are denoted by

𝐾(𝑝₀, 𝑝) = ∫ log(𝑝₀/𝑝) 𝑑𝑃₀, 𝐾₂(𝑝₀, 𝑝) = ∫(log(𝑝₀/𝑝))²𝑑𝑃₀,

6

(18)

1.3 main results

with𝑃₀the measure corresponding to the density𝑝₀.

We are primarily interested in the Laplace kernel, but a number of results are true for general kernels𝑓. The Fourier transform of a function𝑓 and the inverse Fourier transform of a function ̃𝑓 are given by

̃𝑓(𝜆) = ∫ 𝑒^𝚤𝜆𝑥𝑓(𝑥)𝑑𝑥, 𝑓(𝑥) = 1

2𝜋∫ 𝑒^{−𝚤𝜆𝑥}𝑓(𝜆)𝑑𝜆.̃ For_𝑝¹ +¹_𝑞 = 1 and 1 ≤ 𝑝 ≤ 2, Hausdorff-Young’s inequality gives that

‖𝑓‖_𝑞 ≤ (2𝜋)^−1/𝑝‖ ̃𝑓‖_𝑝.

The covering number𝑁(𝜀, 𝛩, 𝜌) of a metric space (𝛩, 𝜌) is the minimum number of𝜀-balls needed to cover the entire space 𝛩.

Throughout the chapter≲ denotes inequality up to a constant multiple, where the constant is universal or fixed within the context. Fur- thermore𝑎_𝑛≍ 𝑏_𝑛means that for some positive constants𝑐 and 𝐶

𝑐 ≤ lim inf

𝑛→∞ 𝑎_𝑛/𝑏_𝑛≤ lim sup

𝑛→∞

𝑎_𝑛/𝑏_𝑛≤ 𝐶.

We denote by ℳ[−𝑎, 𝑎] the set of all probability measures on a given interval[−𝑎, 𝑎].

1.3 main results

Write𝛱_𝑛(⋅ ∣ 𝑋₁, … , 𝑋_𝑛) as the posterior distribution for 𝐺 in the scheme (i)-(iv) introduced at the beginning of the chapter. We study this random distribution assuming that𝑋₁, … , 𝑋_𝑛are an iid sample from the mixture density𝑝_𝐺₀ = 𝑓 ∗ 𝐺₀, for a given probability dis- tribution𝐺₀. We assume that𝐺₀is supported in a compact interval [−𝑎, 𝑎], and that the base measure 𝛼 of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from0 and ∞.

Theorem 1.1. If𝐺₀is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then for every 𝑘 ≥ 1, there exists a constant 𝑀 such that

𝛱(𝐺 ∶ 𝑊_𝑘(𝐺, 𝐺₀) ≥ 𝑀𝑛⁻^8𝑘+16³ (log 𝑛)^𝑘+7/8^𝑘+2 ∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.1) in𝑃_𝐺₀-probability.

The rate for the Wasserstein metric𝑊_𝑘given in the theorem deteriorates with increasing𝑘, which is perhaps not unreasonable as the Wasserstein metrics increase with𝑘. The fastest rate is obtained for

(19)

𝑊₁at𝑛^−1/8(log 𝑛)^5/8.

Theorem 1.2. If𝐺₀is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then there exists a constant 𝑀 such that

𝛱_𝑛(𝐺 ∶ ℎ(𝑝_𝐺, 𝑝_𝐺₀) ≥ 𝑀(log 𝑛/𝑛)^3/8∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.2) in𝑃_𝐺₀-probability. Furthermore, for every𝑞 ∈ [2, ∞) there exists 𝑀_𝑞 such that

𝛱_𝑛(𝐺 ∶ ‖𝑝_𝐺− 𝑝_𝐺₀‖_𝑞≥ 𝑀_𝑞(log 𝑛/𝑛)(𝑞+1)/(𝑞(𝑞+2))∣ 𝑋₁, … , 𝑋_𝑛) → 0, (1.3) in𝑃_𝐺₀-probability.

The rate for the𝐿_𝑞-distance given in (1.3) deteriorates with increas- ing𝑞. For 𝑞 = 2 it is the same as the rate (log 𝑛/𝑛)^3/8for the Hellinger distance.

In both theorems the mixing distributions are assumed to be supported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.

1.4 finite approximation

In this section we show that a general mixture𝑝_𝐺 can be approxi- mated by a mixture with finitely many components, where the number of components depends on the accuracy of the approximation, the distance used, and the kernel𝑓. We first consider approximations with respect to the𝐿_𝑞-norm, which applies to mixtures𝑝_𝐺 = 𝑓 ∗ 𝐺, for a general kernel𝑓, and next approximations with respect to the Hellinger distance for the case that𝑓 is the Laplace kernel. The first result generalizes the result of [31] for normal mixtures. Also see [71]

for results on Dirichlet mixtures of exponential power densities.

The result splits in two cases, depending on the tail behaviour of the Fourier transform ̃𝑓 of 𝑓:

-ordinary smooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)||𝜆|^𝛽< ∞, for some 𝛽 > 1/2.

-supersmooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)|𝑒^|𝜆|^𝛽 < ∞, for some 𝛽 > 0.

Lemma 1.3 (Approximation Lemma). Let𝜀 < 1 be sufficiently small and fixed. For a probability measure𝐺 on an interval [−𝑎, 𝑎] and 2 ≤

8

(20)

1.4 finite approximation

𝑞 ≤ ∞, there exists a discrete measure 𝐺^′on[−𝑎, 𝑎] with at most 𝑁 support points in[−𝑎, 𝑎] such that

‖𝑝_𝐺− 𝑝_𝐺^′‖_𝑞≲ 𝜀, where

(i) 𝑁 ≲ 𝜀^{−(𝛽−𝑝}⁻¹⁾⁻¹if𝑓 is ordinary smooth of order 𝛽 with 𝛽 > 𝑝⁻¹, for𝑝 and 𝑞 being conjugate (𝑝⁻¹+ 𝑞⁻¹ = 1).

(ii) 𝑁 ≲ (log 𝜀⁻¹)^max(1,𝛽⁻¹⁾if𝑓 is supersmooth of order 𝛽.

Proof. The Fourier transform of𝑝_𝐺is given by𝑓 ̃𝐺, where ̃̃ 𝐺 is the Fourier transform of𝐺 defined by ̃𝐺(𝜆) = ∫ 𝑒^𝚤𝜆𝑧𝑑𝐺(𝑧). Determine 𝐺^′so that it possesses the same moments as𝐺 up to order 𝑘 − 1, i.e.

∫ 𝑧^𝑗𝑑(𝐺 − 𝐺^′)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺^′can be chosen to have at most𝑘 support points.

Then for𝐺 and 𝐺^′supported on[−𝑎, 𝑎], we have

| ̃𝐺(𝜆) − ̃𝐺^′(𝜆)| = |

|

∫ (𝑒^𝚤𝜆𝑧−

𝑘−1

∑

𝑗=0

(𝚤𝜆𝑧)^𝑗

𝑗! ) 𝑑(𝐺 − 𝐺^′)(𝑧)|

|

≤ ∫|𝚤𝜆𝑧|^𝑘

𝑘! 𝑑(𝐺 + 𝐺^′)(𝑧) ≤ (𝑎𝑒|𝜆|

𝑘 )^𝑘.

The inequality comes from|𝑒^𝑖𝑦−∑^𝑘−1_𝑗=0(𝑖𝑦)^𝑗/𝑗!| ≤ |𝑦|^𝑘/𝑘! ≤ (𝑒|𝑦|)^𝑘/𝑘^𝑘, for every𝑦 ∈ ℝ.

Therefore, by Hausdorff-Young’s inequality,

‖𝑝_𝐺− 𝑝_𝐺^′‖^𝑝_𝑞 ≤ 1

2𝜋∫ | ̃𝑓(𝜆)|^𝑝| ̃𝐺(𝜆) − ̃𝐺^′(𝜆)|^𝑝𝑑𝜆

≲ ∫

|𝜆|>𝑀

| ̃𝑓(𝜆)|^𝑝𝑑𝜆 + ∫

|𝜆|≤𝑀

(𝑒𝑎|𝜆|

𝑘 )^𝑝𝑘𝑑𝜆.

We denote the first term in the preceding display by𝐼₁and the second term by𝐼₂. It is easy to bound𝐼₂as

𝐼₂≍ (𝑒𝑎

𝑘)^𝑘𝑝𝑀^𝑘𝑝+1

𝑘𝑝 + 1 ≲ (𝑒𝑎𝑀 𝑘 )^𝑘𝑝+11

𝑝.

For𝐼₁we separately consider the cases of ordinary smoothness and supersmoothness.

(21)

In the supersmooth case with parameter𝛽, we note that the function(𝑡^𝛽⁻¹⁻¹)/𝑒^𝛿𝑡is monotonically decreasing for𝑡 ≥ 𝑝𝑀^𝛽, when𝛿 ≥ (𝛽⁻¹− 1)/(𝑝𝑀^𝛽). Thus, for large 𝑀,

𝐼₁≲ ∫

|𝜆|>𝑀

𝑒^{−𝑝|𝜆|}^𝛽𝑑𝜆 = 2 𝛽𝑝^𝛽⁻¹ ∫

𝑡>𝑝𝑀^𝛽

𝑒^−𝑡𝑡^𝛽⁻¹⁻¹𝑑𝑡

≤ 2

𝛽𝑝^𝛽⁻¹ ∫

𝑡>𝑝𝑀^𝛽

𝑒^{−(1−𝛿)𝑡}𝑑𝑡(𝑝𝑀^𝛽)^𝛽⁻¹⁻¹ 𝑒^𝛿𝑝𝑀^𝛽 = 2

1 − 𝛿 1

𝛽𝑝𝑒^−𝑝𝑀^𝛽𝑀^1−𝛽, where the bound is sharper if𝛿 is smaller. Choosing the minimal value of𝛿, we obtain

𝐼₁≲ 1

1 − (𝛽⁻¹− 1)/(𝑝𝑀^𝛽) 1

𝛽𝑝𝑒^−𝑝𝑀^𝛽𝑀^1−𝛽≲ 𝑀^1−𝛽𝑒^−𝑝𝑀^𝛽, for𝑀 sufficiently large. We next choose 𝑀 = 2(log(1/𝜀))¹^𝛽 in order to ensure that𝐼₁≤ 𝜀^𝑝. Then𝐼₂≲ 𝜀^𝑝if𝑘 ≥ 2𝑒𝑎𝑀 and 2^−𝑘𝑝≤ 𝜀^𝑝. This is satisfied if𝑘 = 2(log 𝜀⁻¹)^max(𝛽⁻¹^,1).

In the ordinary smooth case with smoothness parameter𝛽, we have the bound

𝐼₁≲ ∫

|𝜆|>𝑀

|𝜆|^−𝛽𝑝𝑑𝜆 ≲ (1 𝑀)^𝛽𝑝−1.

We choose𝑀 = (1/𝜀)^{−(𝛽−1/𝑝)}⁻¹ to render the right side equal to𝜀^𝑝. Then𝐼₂≲ 𝜀^𝑝if𝑘 = 2𝜀^{−(𝛽−1/𝑝)}⁻¹.

The number of support points in the preceding lemma is increasing in𝑞 and decreasing in 𝛽. For approximation in the 𝐿₂-norm (𝑞 = 2), the number of support points is of order 𝜀^{−1/(𝛽−1/2)}, and this re- duces to𝜀^−2/3for the Laplace kernel (ordinary smooth with𝛽 = 2). A interpretation of the exponent𝛽−1/2 is the (almost) Sobolev smoothness of𝑝_𝐺, since, for𝛼 < 𝛽 − 1/2,

∫(1 + |𝜆|²)^𝛼| ̃𝑝_𝐺(𝜆)|²𝑑𝜆 ≲ ∫(1 + |𝜆|²)^𝛼| ̃𝑓(𝜆)|²𝑑𝜆 < ∞.

We do not have a compelling intuition for this correspondence.

The Hellinger distance is more sensitive to areas where the densities are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.

Lemma 1.4. For a probability measure𝐺 supported on [−𝑎, 𝑎] there exists a discrete measure𝐺^′with at most𝑁 ≍ 𝜀^−2/3support points

10

(22)

1.4 finite approximation

such that for𝑝_𝐺 = 𝑓 ∗ 𝐺 and 𝑓 the Laplace density ℎ(𝑝_𝐺, 𝑝_𝐺^′) ≤ 𝜀.

Proof. Since𝑝_𝐺(𝑥) ≥ 𝑓(|𝑥| + 𝑎) = 𝑒^−𝑎𝑒^−|𝑥|/2, for every 𝑥 and probability measure𝐺 supported on [−𝑎, 𝑎], the Hellinger distance between Laplace mixtures satisfies

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫(𝑝_𝐺− 𝑝_𝐺^′)²

𝑝_𝐺+ 𝑝_𝐺^′ (𝑥) 𝑑𝑥 ≤ 𝑒^𝑎∫(𝑝_𝐺^′(𝑥) − 𝑝_𝐺(𝑥))²𝑒^|𝑥|𝑑𝑥.

If we write𝑞_𝐺(𝑥) = 𝑝_𝐺(𝑥)𝑒^|𝑥|/2, and𝑞_𝐺̃ for the corresponding Fourier transform, then by Plancherel’s theorem the integral in the right side is equal to

1

2𝜋∫ | ̃𝑞_𝐺^′− ̃𝑞_𝐺|²(𝜆) 𝑑𝜆.

By an explicit computation we obtain

̃

𝑞_𝐺(𝜆) = 1

2∫ ∫ 𝑒^𝚤𝜆𝑥𝑒−|𝑥−𝑧|+|𝑥|/2𝑑𝑥 𝑑𝐺(𝑧) =1

2∫ 𝑟(𝜆, 𝑧) 𝑑𝐺(𝑧), where𝑟(𝜆, 𝑧) is given by

𝑟(𝜆, 𝑧) = 𝑒^−𝑧

𝚤𝜆 + 1/2+ 𝑒^−𝑧𝑒^{(𝚤𝜆+3/2)𝑧}− 1

𝚤𝜆 + 3/2 −𝑒^{(𝚤𝜆+1/2)𝑧} 𝚤𝜆 − 1/2

= 𝑒^−𝑧

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)− 2𝑒^𝚤𝜆𝑧𝑒^𝑧/2

(𝚤𝜆 + 3/2)(𝚤𝜆 − 1/2). (1.4) Now let𝐺^′be a discrete measure on[−𝑎, 𝑎] such that

∫ 𝑒^−𝑧𝑑(𝐺^′− 𝐺)(𝑧) = 0,

∫ 𝑒^𝑧/2𝑧^𝑗𝑑(𝐺^′− 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺^′can be chosen to have at most𝑘 + 1 support points.

By the choice of𝐺^′the first term of𝑟(𝜆, 𝑧) gives no contribution to the difference∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′−𝐺)(𝑧). As the second term of 𝑟(𝜆, 𝑧) is for large|𝜆| bounded in absolute value by a multiple of |𝜆|⁻², it follows that

𝐼₂∶= ∫

|𝜆|>𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆 ≲ ∫

𝜆>𝑀

𝜆⁻⁴𝑑𝜆 ≍ 𝑀⁻³.

(23)

Again by the choice of𝐺^′, the integral∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧) remains the same if we replace𝑒^𝚤𝜆𝑧by𝑒^𝚤𝜆𝑧− ∑^𝑘_𝑗=0(𝚤𝜆𝑧)^𝑗/𝑗! in the second term of𝑟(𝜆, 𝑧). It follows that

𝐼₁∶= ∫

|𝜆|≤𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆

≤ ∫

|𝜆|≤𝑀

| 2

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)|

2

| ∫ 𝑒^𝑧/2[𝑒^𝚤𝜆𝑧−

𝑘

∑

𝑗=0

(𝚤𝜆𝑧)^𝑗] 𝑑(𝐺^′− 𝐺)(𝑧)|

2

𝑑𝜆

≲ ∫

𝑀 0

(𝑧𝜆)^2𝑘

(𝑘!)² 𝑑𝜆 ≲ (𝑎𝑒𝑀)^2𝑘+1 𝑘^2𝑘+1 .

It follows, by a similar argument as in the proof of Lemma 1.3, that we can reduce both𝐼₁and𝐼₂to𝜀²by choosing and𝑀 ≍ 𝜀^−2/3and 𝑘 = 2𝑎𝑒𝑀.

1.5 entropy

We study the covering numbers of the class of mixtures𝑝_𝐺 = 𝑓 ∗ 𝐺, where𝐺 ranges over the collection ℳ[−𝑎, 𝑎] of all probability measures on[−𝑎, 𝑎]. We present a bound for any 𝐿_𝑟-norm and general kernels𝑓, and a bound for the Hellinger distance that is specific to the Laplace kernel. Note that𝑓⁽¹⁾is the first derivative of𝑓.

Proposition 1.5. If both‖𝑓‖_𝑟and‖𝑓⁽¹⁾‖_𝑟are finite and ̃𝑓 has ordinary smoothness𝛽, then, for 𝑝_𝐺= 𝑓 ∗ 𝐺, and any 𝑟 ≥ 2,

log 𝑁(𝜀, {𝑝_𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ‖⋅‖_𝑟) ≲ 𝜀⁻^{𝛽−1+1/𝑟}¹ log 𝜀⁻¹. (1.5) Proof. Consider an𝜀-net of 𝒢_𝑎 = {𝑝_𝐺 ∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]} by con- structing ℐ the collection of all𝑝_𝐺’s such that the mixing measure 𝐺 ∈ ℳ[−𝑎, 𝑎] is discrete and has at most 𝑁 ≤ 𝐷𝜀^{−(𝛽−1+𝑟}⁻¹⁾⁻¹support points for some proper constant𝐷.

In light of Lemma 1.3, the set of all mixtures𝑝_𝐺with𝐺 a discrete probability measure with𝑁 ≲ 𝜀^{−(𝛽−1+𝑟}⁻¹⁾⁻¹ support points forms an 𝜀-net over the set of all mixtures 𝑝_𝐺 as in the lemma. It suffices to construct an𝜀-net of the given cardinality over this set of discrete mixtures.

12

(24)

1.5 entropy

By Jensen’s inequality and Fubini’s theorem,

‖𝑓(⋅ − 𝜃) − 𝑓‖_𝑟= (∫|𝜃 ∫

1

0

𝑓⁽¹⁾(𝑥 − 𝜃𝑠) 𝑑𝑠|^𝑟𝑑𝑥)

1/𝑟

≤ ‖𝑓⁽¹⁾‖_𝑟𝜃.

Furthermore, for any probability vectors𝑝 and 𝑝^′and locations𝜃_𝑖,

‖

𝑁

∑

𝑖=1

𝑝_𝑖𝑓(⋅ − 𝜃_𝑖) −

𝑁

∑

𝑖=1

𝑝^′_𝑖𝑓(⋅ − 𝜃_𝑖)‖

𝑟

≤

𝑁

∑

𝑖=1

|𝑝_𝑖− 𝑝^′_𝑖|‖𝑓(⋅ − 𝜃_𝑖)‖_𝑟

= ‖𝑓‖_𝑟‖𝑝 − 𝑝^′‖₁.

Combining these inequalities, we see that for two discrete probability measures𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖and𝐺^′= ∑^𝑁_𝑖=1𝑝^′_𝑖𝛿_𝜃^′_𝑖,

‖𝑝_𝐺− 𝑝_𝐺^′‖_𝑟 ≤ ‖𝑓⁽¹⁾‖_𝑟max

𝑖 |𝜃_𝑖− 𝜃^′_𝑖| + ‖𝑓‖_𝑟‖𝑝 − 𝑝^′‖₁. (1.6) Thus we can construct an𝜀-net over the discrete mixtures by relocating the support points(𝜃_𝑖)^𝑁_𝑖=1to the nearest points(𝜃^′_𝑖)^𝑁_𝑖=1in an𝜀-net on[−𝑎, 𝑎], and relocating the weights 𝑝 to the nearest point 𝑝^′in an 𝜀-net for the 𝑙₁-norm over the𝑁-dimensional 𝑙₁-unit simplex. This gives a set of at most

(2𝑎 𝜀 )

𝑁

(5 𝜀)

𝑁

∼ (10𝑎 𝜀² )

𝑁

measures𝑝_𝐺(cf. Lemma A.4 of [33] for the entropy of the𝑙₁-unit simplex). This gives the bound of the lemma.

Proposition 1.6. For𝑓 the Laplace kernel and 𝑝_𝐺= 𝑓 ∗ 𝐺,

log 𝑁(𝜀, {𝑝_𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ℎ) ≲ 𝜀^−3/8log(𝜀⁻¹). (1.7) Proof. Since the function √𝑓 is absolutely continuous with derivative 𝑥 ↦ −2^−3/2𝑒^−|𝑥|/2sgn(𝑥), we have by Jensen’s inequality and Fubini’s theorem that

ℎ²(𝑓, 𝑓(⋅ − 𝜃)) = ∫(𝜃 ∫

1

0

−2^−3/2𝑒^{−|𝑥−𝜃𝑠|/2}sgn(𝑥 − 𝜃𝑠) 𝑑𝑠)²𝑑𝑥

≤ 𝜃²∫

1

0

∫ 𝑒^{−|𝑥−𝜃𝑠|}𝑑𝑥 𝑑𝑠 = 2𝜃². It follows thatℎ(𝑓, 𝑓(⋅ − 𝜃)) ≲ 𝜃.

(25)

By convexity of the map(𝑢, 𝑣) ↦ (√𝑢 − √𝑣)², we have

|

√∑𝑖

𝑝_𝑖𝑓(⋅ − 𝜃_𝑖) − √∑

𝑖

𝑝_𝑖𝑓(⋅ − 𝜃^′_𝑖)|

|

2

≤ ∑

𝑖

𝑝_𝑖[√𝑓(⋅ − 𝜃_𝑖) − √𝑓(⋅ − 𝜃^′𝑖)]².

By integrating this inequality we see that the densities𝑝_𝐺 and𝑝_𝐺^′ with mixing distributions𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖and𝐺^′= ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃^′_𝑖 satisfy ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≲ ∑ 𝑝_𝑖|𝜃_𝑖− 𝜃^′_𝑖|²≤ ‖𝜃 − 𝜃^′‖²_∞.

For distributions𝐺 = ∑^𝑁_𝑖=1𝑝_𝑖𝛿_𝜃_𝑖 and𝐺^′ = ∑^𝑁_𝑖=1𝑝^′_𝑖𝛿_𝜃_𝑖 with the same support points, but different weights, we have

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫( ∑_𝑖=1^𝑁 (𝑝_𝑖− 𝑝^′_𝑖)𝑓(𝑥 − 𝜃_𝑖))²

∑^𝑁_𝑖=1(𝑝_𝑖+ 𝑝^′𝑖)𝑓(𝑥 − 𝜃_𝑖) 𝑑𝑥

≤ ∫ (

𝑁

∑

𝑖=1

|𝑝_𝑖− 𝑝^′_𝑖|)²𝑓²(|𝑥| − 𝑎)

2𝑓(|𝑥| + 𝑎)𝑑𝑥 ≲ ‖𝑝 − 𝑝^′‖²₁. Therefore the bound follows by arguments similar as in the proof of Proposition 1.5, where presently we use Lemma 1.4 to determine suitable finite approximations.

The map𝐺 ↦ 𝑝_𝐺 = 𝑓 ∗ 𝐺 is one-to-one as soon as the characteristic function of𝑓 is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.

Proposition 1.7. For any𝑘 ≥ 1, and any sufficiently small 𝜀 > 0, log 𝑁(𝜀, ℳ[−𝑎, 𝑎], 𝑊_𝑘) ≲ 𝜀⁻¹log 𝜀⁻¹. (1.8) The proposition is a consequence of Lemma 1.9, below, which applies to the set of all Borel probability measures on a general metric space(𝛩, 𝜌) (cf. [61]).

Lemma 1.8. For any probability measure𝐺 concentrated on countably many disjoint sets𝛩₁, 𝛩₂, … and probability measure 𝐺^′concentrated on disjoint sets𝛩^′₁, 𝛩^′₂, …,

𝑊_𝑘(𝐺, 𝐺^′) ≤ sup

𝑖

sup

𝜃_𝑖∈𝛩_𝑖 𝜃^′𝑖∈𝛩𝑖^′

𝜌(𝜃_𝑖, 𝜃^′_𝑖) + diam(𝛩)( ∑

𝑖

|𝐺(𝛩_𝑖) − 𝐺^′(𝛩_𝑖^′)|)^1/𝑘.

14

(26)

1.6 prior mass

In particular,

𝑊_𝑘( ∑

𝑖

𝑝_𝑖𝛿_𝜃_𝑖, ∑

𝑖

𝑝^′_𝑖𝛿_𝜃^′_𝑖) ≤ max

𝑖 𝜌(𝜃_𝑖, 𝜃^′_𝑖) + diam(𝛩)‖𝑝 − 𝑝^′‖^1/𝑘1 . Proof. For𝑝_𝑖= 𝐺(𝛩_𝑖) and 𝑝^′_𝑖 = 𝐺^′(𝛩_𝑖^′) divide the interval [0, ∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖] into disjoint intervals 𝐼_𝑖of lengths𝑝_𝑖∧ 𝑝^′_𝑖. We couple variables

̄𝜃 and ̄𝜃^′by an auxiliary uniform variable𝑈. If 𝑈 ∈ 𝐼_𝑖, then generate

̄𝜃 ∼ 𝐺(⋅|𝛩_𝑖) and ̄𝜃^′∼ 𝐺^′(⋅|𝛩^′_𝑖). Divide the remaining interval [∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖, 1] into intervals 𝐽_𝑖of lengths𝑝_𝑖− 𝑝_𝑖∧ 𝑝^′_𝑖and, separately, intervals 𝐽_𝑖^′of length𝑝^′_𝑖 − 𝑝_𝑖∧ 𝑝^′_𝑖. If𝑈 ∈ 𝐽_𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩_𝑖) and if𝑈 ∈ 𝐽_𝑖^′, then generate ̄𝜃^′ ∼ 𝐺^′(⋅|𝛩_𝑖^′). Then ̄𝜃 and ̄𝜃^′have marginal distributions𝐺 and 𝐺^′, and

𝔼𝜌^𝑘( ̄𝜃, ̄𝜃^′) ≤ 𝔼[𝜌^𝑘( ̄𝜃, ̄𝜃^′)1_𝑈≤∑

𝑖𝑝_𝑖∧𝑝^′_𝑖] + diam(𝛩)^𝑘ℙ(𝑈 > ∑

𝑖

𝑝_𝑖∧ 𝑝^′_𝑖).

The first term is bounded by the𝑘-th power of the first term of the lemma, while the probability in the second term is equal to1 − ∑_𝑖𝑝_𝑖∧ 𝑝^′_𝑖 = ∑_𝑖|𝑝_𝑖− 𝑝^′_𝑖|/2.

Lemma 1.9. For the set ℳ(𝛩) of all Borel probability measures on a metric space(𝛩, 𝜌), any 𝑘 ≥ 1, and 0 < 𝜀 < min{2/3, diam(𝛩)},

𝑁(𝜀, ℳ(𝛩), 𝑊_𝑘) ≤ (4 diam(𝛩)

𝜀 )^{𝑘𝑁(𝜀,𝛩,𝜌)}.

Proof. For a minimal𝜀-net over 𝛩 of 𝑁 = 𝑁(𝜀, 𝛩, 𝜌) points, let 𝛩 =

∪_𝑖𝛩_𝑖be the partition obtained by assigning each𝜃 to a closest point.

For any𝐺 let 𝐺_𝜀 = ∑_𝑖𝐺(𝛩_𝑖)𝛿_𝜃_𝑖, for arbitrary but fixed𝜃_𝑖 ∈ 𝛩_𝑖. Since 𝑊_𝑘(𝐺, 𝐺_𝜀) ≤ 𝜀 by Lemma 1.8, 𝑁(2𝜀, ℳ(𝛩), 𝑊_𝑘) ≤ 𝑁(𝜀, ℳ_𝜀, 𝑊_𝑘) holds for ℳ𝜀 the set of all𝐺_𝜀. We next form the measures𝐺_𝜀,𝑝 =

∑_𝑖𝑝_𝑖𝛿_𝜃_𝑖for(𝑝₁, … , 𝑝_𝑁) ranging over an (𝜀/ diam(𝛩))^𝑘-net for the𝑙₁- distance over the𝑁-dimensional unit simplex. By Lemma 1.8 every 𝐺_𝜀 is within𝑊_𝑘-distance of some𝐺_𝜀,𝑝. Therefore the proof is com- pleted because𝑁(𝜀, ℳ𝜀, 𝑊_𝑘) is bounded from above by the number of points𝑝, which is bounded by (4 diam(𝛩)/𝜀)^𝑘𝑁(cf. Lemma A.4 in [31]).

1.6 prior mass

This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neigh-

(27)

bourhood of a mixture𝑝_𝐺₀.

Proposition 1.10. If𝛱 is the Dirichlet process DP(𝛼) with base mea- sure𝛼 that has a Lebesgue density bounded away from 0 and ∞ on its support[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then for every sufficiently small𝜀 > 0 and every probability measure 𝐺₀on[−𝑎, 𝑎],

log 𝛱(𝐺 ∶ 𝐾(𝑝_𝐺, 𝑝_𝐺₀) ≤ 𝜀², 𝐾₂(𝑝_𝐺, 𝑝_𝐺₀) ≤ 𝜀²) ≳ (1

𝜀)^2/3log (1 𝜀).

Proof. By Lemma 1.4 there exists a discrete measure𝐺₁ with𝑁 ≲ 𝜀^−2/3support points such thatℎ(𝑝_𝐺₀, 𝑝_𝐺₁) ≤ 𝜀. We may assume that the support points of𝐺₁are at least2𝜀²-separated. If not, we take a maximal2𝜀²-separated set in the support points of𝐺₁, and replace 𝐺₁by the discrete measure obtained by relocating the masses of𝐺₁ to the nearest points in the2𝜀²-net. Thenℎ(𝑝_𝐺₁, 𝑝_𝐺^′₁) ≲ 𝜀², as seen in the proof of Proposition 1.6.

Now by Lemmas 1.11 and 1.12, if𝐺₁= ∑^𝑁_𝑖=1𝑝_𝑗𝛿_𝑧_𝑗, with the support points𝑧_𝑗at least2𝜀²-separated,

{𝐺 ∶ max(𝐾, 𝐾₂)(𝑝_𝐺₀, 𝑝_𝐺) < 𝑑₁𝜀²} ⊃ {𝐺 ∶ ℎ(𝑝_𝐺₀, 𝑝_𝐺) ≤ 2𝜀}

⊃ {𝐺 ∶ ℎ(𝑝_𝐺₁, 𝑝_𝐺) ≤ 𝜀}

⊃ {𝐺 ∶ ‖𝑝_𝐺− 𝑝_𝐺₁‖₁≤ 𝑑₂𝜀²}

⊃ {𝐺 ∶

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀², 𝑧_𝑗+ 𝜀²] − 𝑝_𝑗| ≤ 𝜀²}.

Since the base measure𝛼 has density bounded away from zero and infinity on[−𝑎, 𝑎] by assumption, by Lemma A.2 of [31], we have

log 𝛱(𝐺 ∶

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀², 𝑧_𝑗+ 𝜀²] − 𝑝_𝑗| ≤ 𝜀²) ≳ −𝑁 log 𝜀⁻¹.

The lemma follows upon combining the preceding.

Lemma 1.11. If𝐺^′ = ∑^𝑁_𝑗=1𝑝_𝑖𝛿_𝑧_𝑗 is a probability measure supported on points𝑧₁, … , 𝑧_𝑁inℝ with |𝑧_𝑗 − 𝑧_𝑘| > 2𝜀 for 𝑗 ≠ 𝑘, then for any probability measure𝐺 on ℝ and kernel 𝑓 with its derivative 𝑓⁽¹⁾,

‖𝑝_𝐺− 𝑝_𝐺^′‖₁≤ 2‖𝑓⁽¹⁾‖₁𝜀 + 2

𝑁

∑

𝑗=1

|𝐺[𝑧_𝑗− 𝜀, 𝑧_𝑗+ 𝜀] − 𝑝_𝑗|.

16

(28)

1.7 proof of theorem 1.1

Lemma 1.12. If𝐺 and 𝐺^′are probability measures on[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≲ ‖𝑝_𝐺− 𝑝_𝐺^′‖₂, (1.9) max (𝐾(𝑝_𝐺, 𝑝_𝐺^′), 𝐾₂(𝑝_𝐺, 𝑝_𝐺^′)) ≲ ℎ²(𝑝_𝐺, 𝑝_𝐺^′). (1.10) Proofs. The first lemma is a generalization of Lemma 4 in [33] from normal to general kernels, and is proved in the same manner. We omit further details.

In view of the shape of the Laplace kernel, it is easy to see that for 𝐺 compactly supported on [−𝑎, 𝑎],

𝑓(|𝑥| + 𝑎) ≤ 𝑝_𝐺(𝑥) ≤ 𝑓(|𝑥| − 𝑎), We bound the squared Hellinger distance as follows:

ℎ²(𝑝_𝐺, 𝑝_𝐺^′) ≤ ∫(𝑝_𝐺− 𝑝_𝐺^′)² 𝑝_𝐺+ 𝑝_𝐺^′ 𝑑𝑥

≤ ∫

|𝑥|≤𝐴

𝑒^𝐴+𝑎(𝑝_𝐺− 𝑝_𝐺^′)²𝑑𝑥 + ∫

|𝑥|>𝐴

(𝑝_𝐺+ 𝑝_𝐺^′)𝑑𝑥

≲ 𝑒^𝑎‖𝑝_𝐺− 𝑝_𝐺^′‖²₂𝑒^𝐴+ 𝑒^−𝐴.

By the elementary inequality𝑡 +^𝑢_𝑡 ≥ 2√𝑢, for 𝑢, 𝑡 > 0, we obtain (1.9) upon choosing𝐴 = min(𝑎, log ‖𝑝_𝐺− 𝑝_𝐺^′‖⁻¹₂ − 𝑎/2).

For the proof of the second assertion we first note that, if both𝐺 and𝐺^′are compactly supported on[−𝑎, 𝑎],

𝑝_𝐺(𝑥)

𝑝_𝐺^′(𝑥)≤ 𝑓(|𝑥| − 𝑎) 𝑓(|𝑥| + 𝑎) ≤ 𝑒^2𝑎.

Therefore‖𝑝_𝐺/𝑝_𝐺^′‖_∞ ≤ 𝑒^2𝑎, and (1.10) follows by Lemma 8 in [33].

1.7 proof of theorem 1.1

The basic theorem of [30] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger distance. In the present situation for the proofs of Theorems 1.1 and 1.2, we would like to apply this result to a power smaller than one of the Wasserstein metric and a power smaller than one of the𝐿_𝑞distance, respectively, both of which are not metrics.

Consider a general “discrepancy measure”𝑑, which is a map 𝑑 ∶ 𝒫 × 𝒫 → ℝ on the product of the set of densities on a given mea-

(29)

surable space and itself, which has the properties, for some constant 𝐶 > 0:

(a) 𝑑(𝑥, 𝑦) ≥ 0;

(b) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;

(c) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);

(d) 𝑑(𝑥, 𝑦) ≤ 𝐶(𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧)).

Thus𝑑 is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant𝐶, possibly bigger than 1. Call a set of the form {𝑥 ∶ 𝑑(𝑥, 𝑦) < 𝑐} a 𝑑-ball, and define covering numbers𝑁(𝜀, 𝒫 , 𝑑) relative to 𝑑 as usual.

Let𝛱_𝑛(⋅ ∣ 𝑋₁, … , 𝑋_𝑛) be the posterior distribution of 𝑝 given an i.i.d. sample𝑋₁, … , 𝑋_𝑛from a density𝑝 that is equipped with a prior probability distribution𝛱.

Theorem 1.13. Suppose𝑑 has the properties as given, satisfies 𝑑(𝑝₀, 𝑝) ≤ ℎ(𝑝₀, 𝑝) for every 𝑝 ∈ 𝒫 , and the sets {𝑝 ∶ 𝑑(𝑝, 𝑝^′) < 𝛿} are convex.

Then𝛱_𝑛(𝑑(𝑝, 𝑝₀) > 𝑀𝜀_𝑛∣ 𝑋₁, … , 𝑋_𝑛) → 0 in 𝑃₀^𝑛-probability for any 𝜀_𝑛such that𝑛𝜀²_𝑛→ ∞ and such that, for positive constants 𝑐₁,𝑐₂and sets 𝒫_𝑛⊂ 𝒫 ,

log 𝑁(𝜀_𝑛, 𝒫_𝑛, 𝑑) ≤ 𝑐₁𝑛𝜀_𝑛², (1.11) 𝛱_𝑛(𝑝 ∶ 𝐾(𝑝₀, 𝑝) < 𝜀²_𝑛, 𝐾₂(𝑝₀, 𝑝) < 𝜀²_𝑛) ≥ 𝑒^−𝑐²^𝑛𝜀²^𝑛, (1.12) 𝛱_𝑛(𝒫 − 𝒫_𝑛) ≤ 𝑒^−(𝑐²^+4)𝑛𝜀²^𝑛. (1.13) We defer the proof until Appendix A.

The proof of Theorem 1.1 is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem2 in [61]. We choose constant 𝐶_𝑘carefully to make sure that the map𝜀 ↦ 𝜀[log(𝐶_𝑘/𝜀)]^𝑘+1/2is monotone on(0, 2].

Lemma 1.14. For probability measures𝐺 and 𝐺^′supported on[−𝑎, 𝑎], and𝑝_𝐺 = 𝑓∗𝐺 for a probability density 𝑓 with inf_𝜆(1+|𝜆|^𝛽)| ̃𝑓(𝜆)| > 0, and any𝑘 ≥ 1,

𝑊_𝑘(𝐺, 𝐺^′) ≲ ℎ(𝑝_𝐺, 𝑝_𝐺^′)^{1/(𝑘+𝛽)}(log 𝐶_𝑘

ℎ(𝑝_𝐺, 𝑝_𝐺^′))(𝑘+1/2)/(𝑘+𝛽)

.

Proof. By Theorem 6.15 in [76] the Wasserstein distance𝑊_𝑘(𝐺, 𝐺^′) is bounded above by a multiple of the𝑘th root of ∫ |𝑥|^𝑘𝑑|𝐺 − 𝐺^′|(𝑥), where|𝐺 − 𝐺^′| is the total variation measure of the difference 𝐺 −

18

Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

Fengnan Gao

¶ BAYES & NETWORKS

Shanghai, April 2017

Bayes & Networks

Proefschrift

Fengnan Gao geboren te Jiangsu, China

in 1988

To my family

回首向来萧瑟处 归去

也无风雨也无晴

苏轼

Part I

1

回首向来萧瑟处归去