• No results found

Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23

N/A
N/A
Protected

Academic year: 2021

Share "Cover Page The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation. Author: Gao, F. Title: Bayes and networks Issue Date: 2017-05-23"

Copied!
184
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Cover Page

The handle http://hdl.handle.net/1887/49012 holds various files of this Leiden University dissertation.

Author: Gao, F.

Title: Bayes and networks

Issue Date: 2017-05-23

(2)

Fengnan Gao

¶ BAYES & NETWORKS

Shanghai, April 2017

(3)

Fengnan Gao: Bayes & Networks, Dirichlet-Laplace Deconvolution and Statistical Inference in Preferential Attachment Networks, © April 2017

The author designed the cover by himself. The bottom-right corner lies a phoenix, which Feng in the author’s name stands for in the Chi- nese language.

Title Page: The decoration on the margin was modified from the code published on TEX StackExchange by Gonzalo Medina. The beautiful network illustration was distributed by Till Tantau, the author of TikZ under the gnu Free Documentation License.

All rights reserved. No part of this publication may be reproduced in any form or by any electronical or mechanical means including informaiton storage and retrieval systems without the prior written permission from the author.

A catalogue record is available from the Leiden University Library.

The research in the dissertation was supported by the Netherlands Organization for Scientific Research (NWO).

(4)

Bayes & Networks

Proefschrift

ter verkrijging van

de graad van Doctor aan de Universiteit Leiden, op gezag van Rector Magnificus prof. mr. C. J. J. M. Stolker,

volgens besluit van het College voor Promoties te verdedigen op dinsdag 23 mei 2017

klokke 10:00 uur

door

Fengnan Gao geboren te Jiangsu, China

in 1988

(5)

Samenstelling van de promotiecommissie:

Promotor:

Prof. dr. A. W. van der Vaart (Universiteit Leiden) Overige Leden:

Prof. dr. B. de Smit (Universiteit Leiden, voorzitter) Prof. dr. J. J. Meulman (Universiteit Leiden, secretaris) Prof. dr. R. W. van der Hofstad (TU Eindhoven) Prof. dr. J. H. van Zanten (Universiteit van Amsterdam) Dr. R. M. Castro (TU Eindhoven)

Dr. A. J. Schmidt-Hieber (Universiteit Leiden)

(6)

To my family

(7)

回首向来萧瑟处 归去

也无风雨也无晴

苏轼

宋神宗元丰五年

Looking back over the bleak passage survived, The return in time,

Shall not be affected by windswept rain or sunshine.

Su Shi (1082)

(8)

C O N T E N T S

i nonparametric bayesian dirichlet-laplace de- convolution 1

1 posterior contraction rates for deconvolu- tion of dirichlet-lapalace mixtures 3 1.1 Introduction 3

1.2 Notation and Preliminaries 6 1.3 Main Results 7

1.4 Finite Approximation 8

1.5 Entropy 12

1.6 Prior Mass 15

1.7 Proof of Theorem 1.1 17 1.8 Proof of Theorem 1.2 20 1.9 Normal Mixtures 21

ii statistical inference in preferential attach- ment networks 23

2 introduction to networks 25

2.1 Network Science 25

2.1.1 The emergence of network science 25 2.1.2 Fundamentals of graph theory 29 2.1.3 Properties of typical networks 30 2.2 Preferential Attachment Networks 31

2.2.1 History and motivation of the pa networks 31 2.2.2 A rather general pa model 34

2.2.3 The linear pa models with random initial degrees 35

2.2.4 The general sublinear pa models 35 2.2.5 The general sublinear parametric pa mod-

els 35

3 estimatation of general pa networks 37 3.1 Introduction 37

3.2 Empirical Estimator 39 3.3 Branching Process 41

3.3.1 Rooted ordered tree 42 3.3.2 Branching process 42

3.3.3 The continuous random tree model 44 3.4 Consistency 46

3.5 Simulation Studies 49

3.5.1 Sample variance study 51 3.5.2 Asymptotic normality? 52

(9)

Contents

4 estimation of affine pa networks 57 4.1 Introduction and Notation 57

4.2 Construction of the MLE 61 4.3 Consistency 65

4.4 Asymptotic Normality 71

4.5 Local Asymptotic Normality and Efficiency 75 4.6 The Case of Fixed Initial Degree 76

4.7 Quasi-Maximum-Likelihood Estimator 76 4.8 Simulation Study 81

4.8.1 On the shoulder of the giants 82 4.8.2 The majority rules 85

5 estimation of parametric pa networks 87 5.1 Introduction and Notation 87

5.2 Construction of the MLE 88 5.3 Consistency 90

5.4 Asymptotic Normality 102

5.5 A Remedy to a Historical Problem 107

iii modeling the dynamics of the movie-actor network 109

6 modeling the dynamics of the movie-actor net- work of the internet movie database 111 6.1 Introduction 111

6.2 Conceptual Model Description 113 6.3 Empirical Fitting to the IMDb Dataset 115

6.3.1 Movie sizes 115

6.3.2 Number of new actors 117 6.3.3 PA function 117

6.3.4 PA function on movie degrees 123 6.3.5 Model fitting 124

6.4 Simulations 124 6.5 Theoretical Study 128

6.6 Conclusion and Future work 130 iv appendix 131

a dirichlet processes and contraction rates rel- ative to non-metrics 133

a.1 Dirichlet Processes 133

a.2 Contraction Rates Relative to Non-metrics 133 b convergence to a power law of the movie de-

grees in the pam-imdb model 135 b.1 Introduction and Heuristics 135 b.2 Proof of Theorem B.1 138

viii

(10)

Contents

b.2.1 Concentration around the mean 139 b.2.2 Identification of the mean sequence 141

bibliography 157 summary 165 samenvatting 167 acknowledgements 169 curriculum vitæ 171 colophon 173

(11)

L I S T O F F I G U R E S

Figure 2.1 Seven Bridges of Königsberg 28 Figure 3.1 Boxplots of ee’s in different settings. 50 Figure 3.2 Sample Variance Study of EE 52 Figure 3.3 QQ-Plots of Empirical Estimators 53 Figure 3.4 Histogram of Rescaled Empirical Estima-

tor 54

Figure 3.5 Estimated Density of √𝑛( ̂𝑟2(𝑛) − 𝑟2) with different network sizes𝑛 55

Figure 4.1 loglog Plot of Empirical Degree Distribu- tion vs. Degree in PA Networks 83

Figure 4.2 Histogram of MLE in Affine PA networks 84 Figure 6.1 Histogram of movie sizes in 1947 115 Figure 6.2 loglog-Histogram of Movie Sizes 116 Figure 6.3 loglog-histogram of all movie sizes until

the end of 2007 116

Figure 6.4 Ratio of New Actors of Movies in 1971 118 Figure 6.6 Actor Degree Evolution 120

Figure 6.7 Fitting a straight line on the loglog-Histogrram on Actor Degrees 120

Figure 6.8 log(1 − ̂𝐹𝑁(𝑘))-vs.-log 𝑘 Plot of Actor De- grees 121

Figure 6.9 Movie Degree Evolution 122

Figure 6.10 Fitting a straight line on the loglog-histogram starting from𝑘 = 40 on movie degrees 122 Figure 6.11 log (1 − ̂𝐹𝑁(𝑘)) vs. log 𝑘 plot of movie de-

grees in year 1947 123

Figure 6.12 Histogram of Movie Degrees by 1950 125 Figure 6.13 Movie Degree Comparisons Between Sim-

ulation and Real Dataset 126 Figure 6.14 Actor Degree Comparison between Sim-

ulation and Read Dataset 127

L I S T O F TA B L E S

Table 2.1 Representative Networks 26 Table 4.1 Summary of the Performance of the MLE

in Affine PA Networks 84

x

(12)

Part I

N O N PA R A M E T R I C BAY E S I A N

D I R I C H L E T- L A P L A C E D E C O N VO L U T I O N

(13)
(14)

1

P O S T E R I O R C O N T R A C T I O N R AT E S F O R D E C O N V O L U T I O N O F D I R I C H L E T- L A PA L A C E M I X T U R E S

1.1 introduction

Consider statistical inference using the following nonparametric hi- erarchical Bayesian model for observations𝑋1, … , 𝑋𝑛:

(i) A probability distribution𝐺 on ℝ is generated from the Dirich- let process priorDP(𝛼) with base measure 𝛼.

(ii) An iid sample𝑍1, … , 𝑍𝑛is generated from𝐺.

(iii) An iid sample𝑒1, … , 𝑒𝑛is generated from a known density𝑓, independent of the other samples.

(iv) The observations are𝑋𝑖= 𝑍𝑖+ 𝑒𝑖, for𝑖 = 1, … , 𝑛.

In this setting the conditional density of the data𝑋1, … , 𝑋𝑛given𝐺 is a sample from the convolution

𝑝𝐺= 𝑓 ∗ 𝐺

of the density𝑓 and the measure 𝐺. The scheme defines a conditional distribution of𝐺 given the data 𝑋1, … , 𝑋𝑛, the posterior distribution of𝐺, and consequently also posterior distributions for quantities that derive from𝐺, including the convolution density 𝑝𝐺. We are inter- ested in whether this posterior distribution can recover a true mixing distribution𝐺0if the observations𝑋1, … , 𝑋𝑛are in reality a sample from the mixed distribution𝑝𝐺0, for some given probability distribu- tion𝐺0.

The main contribution of this chapter is for the case that𝑓 is the Laplace density𝑓(𝑥) = 𝑒−|𝑥|/2. For distributions on the full line Laplace mixtures seem the second most popular class next to mixtures of the normal distribution, with applications in for instance speech recognition or astronomy ([42]) and clustering problem in genetics ([7]). For the present theoretical investigation the Laplace kernel is interesting as a test case of a non-supersmooth kernel.

We consider two notions of recovery. The first notion measures the distance between the posterior of𝐺 and 𝐺0through the Wasser- stein metric

𝑊𝑘(𝐺, 𝐺) = inf

𝛾∈𝛤(𝐺,𝐺)(∫ |𝑥 − 𝑦|𝑘𝑑𝛾(𝑥, 𝑦))1/𝑘,

(15)

dirichlet-laplace mixtures

where𝛤(𝐺, 𝐺) is the collection of all couplings 𝛾 of 𝐺 and 𝐺into a bivariate measure with marginals𝐺 and 𝐺(i.e. if(𝑥, 𝑦) ∼ 𝛾, then 𝑥 ∼ 𝐺 and 𝑦 ∼ 𝐺), and𝑘 ≥ 1. The Wasserstein metric is a clas- sical metric on probability distributions, which is well suited for use in obtaining rates of estimation of measures. It is weaker than the to- tal variation distance (which is more natural as a distance on densi- ties), can be interpreted through transportation of measure (see [76]), and has also been used in applications such as as comparing the color histograms of digital images. Recovery of the posterior distribution relative to the Wasserstein metric was considered by [61], within a general mixing framework. We refer to this paper for further motiva- tion of the Wasserstein metric for mixtures, and to [76] for general background on the Wasserstein metric. In this chapter we improve the upper bound on posterior contraction rates given in [61], at least in the case of the Laplace mixtures, obtaining a rate of nearly𝑛−1/8 for𝑊1(and slower rates for𝑘 > 1). Apparently the minimax rate of contraction for Laplace mixtures relative to the Wasserstein metric is currently unknown. Recent work on recovery of a mixing distribu- tion by non-Bayesian methods is given in [80]. It is not clear from our result whether the upper bound𝑛−1/8is sharp.

The second notion of recovery measures the distance of the pos- terior of𝐺 to 𝐺0indirectly through the Hellinger or𝐿𝑞-distances be- tween the mixed densities𝑝𝐺and𝑝𝐺0. This is equivalent to studying the estimation of the true density𝑝𝐺0of the observations through the density𝑝𝐺under the posterior distribution. As the Laplace kernel𝑓 has Fourier transform

̃𝑓(𝜆) = 1 1 + 𝜆2,

it follows that the mixed densities𝑝𝐺have Fourier transforms satisfy- ing

| ̃𝑝𝐺(𝜆)| ≤ 1 1 + 𝜆2.

Estimation of a density with a polynomially decaying Fourier trans- form was first considered in [77]. According to their Theorem in Sec- tion 3A, a suitable kernel estimator possesses a root mean square er- ror of𝑛−3/8with respect to the𝐿2-norm for estimating a density with Fourier transform that decays exactly at the order 2. This rate is the usual rate𝑛−𝛼/(2𝛼+1)of nonparametric estimation for smoothness𝛼 = 3/2. This is understandable as | ̃𝑝(𝜆)| ≲ 1/(1 + |𝜆|2) implies that ∫(1 +

|𝜆|2)𝛼| ̃𝑝(𝜆)|2𝑑𝜆 < ∞, for every 𝛼 < 3/2, so that a density with Fourier transform decaying at square rate belongs to any Sobolev class

4

(16)

1.1 introduction

of regularity𝛼 < 3/2. Indeed in [34], the rate 𝑛−𝛼/(2𝛼+1)is shown to be minimax for estimating a density in a Sobolev ball of functions on the line. In this chapter we show that the posterior distribution of Laplace mixtures𝑝𝐺contracts to𝑝𝐺0at the rate𝑛−3/8up to a logarithm factor, relative to the𝐿2-norm and Hellinger distance, and also establish rates for other𝐿𝑞-metrics. Thus the Dirichlet posterior (nearly) attains the minimax rate for estimating a density in a Sobolev ball of order3/2. It may be noted that the Laplace density itself is Hölder of exactly order 1, which implies that Laplace mixtures are Hölder smooth of at least the same order. This insight would suggest a rate𝑛−1/3(the usual non- parametric rate for𝛼 = 1), which is slower than 𝑛−3/8, and hence this insight is misleading.

Besides recovery relative to the Wasserstein metric and the in- duced metrics on𝑝𝐺, one might consider recovery relative to a met- ric on the distribution function on𝐺. Frequentist recovery rates for this problem were obtained in [27] under some restrictions. There is no simple relation between these rates and rates for the other met- rics. The same is true for the rates for deconvolution of densities, as in [27]. In fact, the Dirichlet prior and posterior considered here are well known to concentrate on discrete distributions, and hence are useless as priors for recovering a density of𝐺.

Contraction rates for Dirichlet mixtures of the normal kernel were considered in [31, 33, 44, 71, 72]. The results in these papers are driven by the smoothness of the Gaussian kernel, whence the same approach will fail for the Laplace kernel. Nevertheless we borrow the idea of ap- proximating the true mixed density by a finite mixture, albeit that the approximation is constructed in a different manner. Because more support points than in the Gaussian case are needed to obtain a given quality of approximation, higher entropy and lower prior mass con- centration result, leading to a slower rate of posterior contraction. To obtain the contraction rate for the Wasserstein metrics we further de- rive a relationship of these metrics with a power of the Hellinger dis- tance, and next apply a variant of the contraction theorem in [30], whose proof is included in the appendix of the dissertation. Contrac- tion rates of mixtures with other priors than the Dirichlet were consid- ered in [71]. Recovery of the mixing distribution is a deconvolution problem and as such can be considered an inverse problem. A gen- eral approach to posterior contraction rates in inverse problems can be found in [41], and results specific to deconvolution can be found in [24]. These authors are interested in deconvolving a (smooth) mixing density rather than a mixing distribution, and hence their results are not directly comparable to the results in this dissertation.

The papers [28, 49] consider recovery of a mixing density relative

(17)

dirichlet-laplace mixtures

to the𝐿𝑝-norm in the frequentist setting. If the smoothness of the mixing density degenerates to0, then the minimax rate decreases to a constant and it is not possible to find a consistent estimator. In this chapter we show that in the same problem but viewed as a deconvolu- tion problem on distributions, endowed with the weaker Wasserstein distance, we may obtain polynomial rates for the mixing distribution without any smoothness assumption on the distribution. In particu- lar, for any mixing distribution it is possible to construct a consistent estimator.

The chapter is organized as follows. In the next section we give no- tation and preliminaries. We state in Section 1.3 the main results of the chapter, which are proved in the subsequent sections. In Section 1.4 we establish suitable finite approximations relative to the𝐿𝑞- and He- llinger distances. The𝐿𝑞-approximations also apply to other kernels than the Laplace kernel, and are in terms of the tail decay of the ker- nel’s characteristic function. In Sections 1.5 and 1.6 we apply these ap- proximations to obtain bounds on the entropy of the mixtures relative to the𝐿𝑞, Hellinger and Wasserstein metrics, and a lower bound on the prior mass in a neighbourhood of the true density. Sections 1.7 and 1.8 contain the proofs of the main results.

1.2 notation and preliminaries

Throughout the chapter integrals given without limits are considered to be integrals over the real lineℝ. The 𝐿𝑞-norm is denoted by

‖𝑔‖𝑞 = (∫ |𝑔(𝑥)|𝑞𝑑𝑥)1/𝑞,

with‖⋅‖being the uniform norm. The Hellinger distance on the space of densities is given by

ℎ(𝑓, 𝑔) = (∫(𝑓1/2(𝑥) − 𝑔1/2(𝑥))2𝑑𝑥)1/2.

It is easy to see thatℎ2(𝑓, 𝑔) ≤ ‖𝑓 − 𝑔‖1≤ 2ℎ(𝑓, 𝑔), for any two prob- ability densities𝑓 and 𝑔. Furthermore, if the densities 𝑓 and 𝑔 are uniformly bounded by a constant𝑀, then ‖𝑓 − 𝑔‖2 ≤ 2√𝑀ℎ(𝑓, 𝑔).

The Kullback-Leibler discrepancy and corresponding variance are de- noted by

𝐾(𝑝0, 𝑝) = ∫ log(𝑝0/𝑝) 𝑑𝑃0, 𝐾2(𝑝0, 𝑝) = ∫(log(𝑝0/𝑝))2𝑑𝑃0,

6

(18)

1.3 main results

with𝑃0the measure corresponding to the density𝑝0.

We are primarily interested in the Laplace kernel, but a number of results are true for general kernels𝑓. The Fourier transform of a function𝑓 and the inverse Fourier transform of a function ̃𝑓 are given by

̃𝑓(𝜆) = ∫ 𝑒𝚤𝜆𝑥𝑓(𝑥)𝑑𝑥, 𝑓(𝑥) = 1

2𝜋∫ 𝑒−𝚤𝜆𝑥𝑓(𝜆)𝑑𝜆.̃ For𝑝1 +1𝑞 = 1 and 1 ≤ 𝑝 ≤ 2, Hausdorff-Young’s inequality gives that

‖𝑓‖𝑞 ≤ (2𝜋)−1/𝑝‖ ̃𝑓‖𝑝.

The covering number𝑁(𝜀, 𝛩, 𝜌) of a metric space (𝛩, 𝜌) is the minimum number of𝜀-balls needed to cover the entire space 𝛩.

Throughout the chapter≲ denotes inequality up to a constant mul- tiple, where the constant is universal or fixed within the context. Fur- thermore𝑎𝑛≍ 𝑏𝑛means that for some positive constants𝑐 and 𝐶

𝑐 ≤ lim inf

𝑛→∞ 𝑎𝑛/𝑏𝑛≤ lim sup

𝑛→∞

𝑎𝑛/𝑏𝑛≤ 𝐶.

We denote by ℳ[−𝑎, 𝑎] the set of all probability measures on a given interval[−𝑎, 𝑎].

1.3 main results

Write𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) as the posterior distribution for 𝐺 in the scheme (i)-(iv) introduced at the beginning of the chapter. We study this random distribution assuming that𝑋1, … , 𝑋𝑛are an iid sample from the mixture density𝑝𝐺0 = 𝑓 ∗ 𝐺0, for a given probability dis- tribution𝐺0. We assume that𝐺0is supported in a compact interval [−𝑎, 𝑎], and that the base measure 𝛼 of the Dirichlet prior in (i) is concentrated on this interval with a Lebesgue density bounded away from0 and ∞.

Theorem 1.1. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then for every 𝑘 ≥ 1, there exists a constant 𝑀 such that

𝛱(𝐺 ∶ 𝑊𝑘(𝐺, 𝐺0) ≥ 𝑀𝑛8𝑘+163 (log 𝑛)𝑘+7/8𝑘+2 ∣ 𝑋1, … , 𝑋𝑛) → 0, (1.1) in𝑃𝐺0-probability.

The rate for the Wasserstein metric𝑊𝑘given in the theorem dete- riorates with increasing𝑘, which is perhaps not unreasonable as the Wasserstein metrics increase with𝑘. The fastest rate is obtained for

(19)

dirichlet-laplace mixtures

𝑊1at𝑛−1/8(log 𝑛)5/8.

Theorem 1.2. If𝐺0is supported on[−𝑎, 𝑎] with 𝑓 being Laplace kernel and𝛼 has support [−𝑎, 𝑎] with Lebesgue density bounded away from 0 and∞, then there exists a constant 𝑀 such that

𝛱𝑛(𝐺 ∶ ℎ(𝑝𝐺, 𝑝𝐺0) ≥ 𝑀(log 𝑛/𝑛)3/8∣ 𝑋1, … , 𝑋𝑛) → 0, (1.2) in𝑃𝐺0-probability. Furthermore, for every𝑞 ∈ [2, ∞) there exists 𝑀𝑞 such that

𝛱𝑛(𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺0𝑞≥ 𝑀𝑞(log 𝑛/𝑛)(𝑞+1)/(𝑞(𝑞+2))∣ 𝑋1, … , 𝑋𝑛) → 0, (1.3) in𝑃𝐺0-probability.

The rate for the𝐿𝑞-distance given in (1.3) deteriorates with increas- ing𝑞. For 𝑞 = 2 it is the same as the rate (log 𝑛/𝑛)3/8for the Hellinger distance.

In both theorems the mixing distributions are assumed to be sup- ported on a fixed compact set. Without a restriction on the tails of the mixing distributions, no rate is possible. The assumption of a compact support ensures that the rate is fully determined by the complexity of the mixtures, and not their tail behaviour.

1.4 finite approximation

In this section we show that a general mixture𝑝𝐺 can be approxi- mated by a mixture with finitely many components, where the num- ber of components depends on the accuracy of the approximation, the distance used, and the kernel𝑓. We first consider approximations with respect to the𝐿𝑞-norm, which applies to mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, for a general kernel𝑓, and next approximations with respect to the Hellinger distance for the case that𝑓 is the Laplace kernel. The first result generalizes the result of [31] for normal mixtures. Also see [71]

for results on Dirichlet mixtures of exponential power densities.

The result splits in two cases, depending on the tail behaviour of the Fourier transform ̃𝑓 of 𝑓:

-ordinary smooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)||𝜆|𝛽< ∞, for some 𝛽 > 1/2.

-supersmooth𝑓: lim sup

|𝜆|→∞

| ̃𝑓(𝜆)|𝑒|𝜆|𝛽 < ∞, for some 𝛽 > 0.

Lemma 1.3 (Approximation Lemma). Let𝜀 < 1 be sufficiently small and fixed. For a probability measure𝐺 on an interval [−𝑎, 𝑎] and 2 ≤

8

(20)

1.4 finite approximation

𝑞 ≤ ∞, there exists a discrete measure 𝐺on[−𝑎, 𝑎] with at most 𝑁 support points in[−𝑎, 𝑎] such that

‖𝑝𝐺− 𝑝𝐺𝑞≲ 𝜀, where

(i) 𝑁 ≲ 𝜀−(𝛽−𝑝−1)−1if𝑓 is ordinary smooth of order 𝛽 with 𝛽 > 𝑝−1, for𝑝 and 𝑞 being conjugate (𝑝−1+ 𝑞−1 = 1).

(ii) 𝑁 ≲ (log 𝜀−1)max(1,𝛽−1)if𝑓 is supersmooth of order 𝛽.

Proof. The Fourier transform of𝑝𝐺is given by𝑓 ̃𝐺, where ̃̃ 𝐺 is the Fourier transform of𝐺 defined by ̃𝐺(𝜆) = ∫ 𝑒𝚤𝜆𝑧𝑑𝐺(𝑧). Determine 𝐺so that it possesses the same moments as𝐺 up to order 𝑘 − 1, i.e.

∫ 𝑧𝑗𝑑(𝐺 − 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺can be chosen to have at most𝑘 support points.

Then for𝐺 and 𝐺supported on[−𝑎, 𝑎], we have

| ̃𝐺(𝜆) − ̃𝐺(𝜆)| = |

|

∫ (𝑒𝚤𝜆𝑧

𝑘−1

𝑗=0

(𝚤𝜆𝑧)𝑗

𝑗! ) 𝑑(𝐺 − 𝐺)(𝑧)|

|

≤ ∫|𝚤𝜆𝑧|𝑘

𝑘! 𝑑(𝐺 + 𝐺)(𝑧) ≤ (𝑎𝑒|𝜆|

𝑘 )𝑘.

The inequality comes from|𝑒𝑖𝑦−∑𝑘−1𝑗=0(𝑖𝑦)𝑗/𝑗!| ≤ |𝑦|𝑘/𝑘! ≤ (𝑒|𝑦|)𝑘/𝑘𝑘, for every𝑦 ∈ ℝ.

Therefore, by Hausdorff-Young’s inequality,

‖𝑝𝐺− 𝑝𝐺𝑝𝑞 ≤ 1

2𝜋∫ | ̃𝑓(𝜆)|𝑝| ̃𝐺(𝜆) − ̃𝐺(𝜆)|𝑝𝑑𝜆

≲ ∫

|𝜆|>𝑀

| ̃𝑓(𝜆)|𝑝𝑑𝜆 + ∫

|𝜆|≤𝑀

(𝑒𝑎|𝜆|

𝑘 )𝑝𝑘𝑑𝜆.

We denote the first term in the preceding display by𝐼1and the second term by𝐼2. It is easy to bound𝐼2as

𝐼2≍ (𝑒𝑎

𝑘)𝑘𝑝𝑀𝑘𝑝+1

𝑘𝑝 + 1 ≲ (𝑒𝑎𝑀 𝑘 )𝑘𝑝+11

𝑝.

For𝐼1we separately consider the cases of ordinary smoothness and supersmoothness.

(21)

dirichlet-laplace mixtures

In the supersmooth case with parameter𝛽, we note that the func- tion(𝑡𝛽−1−1)/𝑒𝛿𝑡is monotonically decreasing for𝑡 ≥ 𝑝𝑀𝛽, when𝛿 ≥ (𝛽−1− 1)/(𝑝𝑀𝛽). Thus, for large 𝑀,

𝐼1≲ ∫

|𝜆|>𝑀

𝑒−𝑝|𝜆|𝛽𝑑𝜆 = 2 𝛽𝑝𝛽−1

𝑡>𝑝𝑀𝛽

𝑒−𝑡𝑡𝛽−1−1𝑑𝑡

≤ 2

𝛽𝑝𝛽−1

𝑡>𝑝𝑀𝛽

𝑒−(1−𝛿)𝑡𝑑𝑡(𝑝𝑀𝛽)𝛽−1−1 𝑒𝛿𝑝𝑀𝛽 = 2

1 − 𝛿 1

𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽, where the bound is sharper if𝛿 is smaller. Choosing the minimal value of𝛿, we obtain

𝐼1≲ 1

1 − (𝛽−1− 1)/(𝑝𝑀𝛽) 1

𝛽𝑝𝑒−𝑝𝑀𝛽𝑀1−𝛽≲ 𝑀1−𝛽𝑒−𝑝𝑀𝛽, for𝑀 sufficiently large. We next choose 𝑀 = 2(log(1/𝜀))1𝛽 in order to ensure that𝐼1≤ 𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 ≥ 2𝑒𝑎𝑀 and 2−𝑘𝑝≤ 𝜀𝑝. This is satisfied if𝑘 = 2(log 𝜀−1)max(𝛽−1,1).

In the ordinary smooth case with smoothness parameter𝛽, we have the bound

𝐼1≲ ∫

|𝜆|>𝑀

|𝜆|−𝛽𝑝𝑑𝜆 ≲ (1 𝑀)𝛽𝑝−1.

We choose𝑀 = (1/𝜀)−(𝛽−1/𝑝)−1 to render the right side equal to𝜀𝑝. Then𝐼2≲ 𝜀𝑝if𝑘 = 2𝜀−(𝛽−1/𝑝)−1.

The number of support points in the preceding lemma is increas- ing in𝑞 and decreasing in 𝛽. For approximation in the 𝐿2-norm (𝑞 = 2), the number of support points is of order 𝜀−1/(𝛽−1/2), and this re- duces to𝜀−2/3for the Laplace kernel (ordinary smooth with𝛽 = 2). A interpretation of the exponent𝛽−1/2 is the (almost) Sobolev smooth- ness of𝑝𝐺, since, for𝛼 < 𝛽 − 1/2,

∫(1 + |𝜆|2)𝛼| ̃𝑝𝐺(𝜆)|2𝑑𝜆 ≲ ∫(1 + |𝜆|2)𝛼| ̃𝑓(𝜆)|2𝑑𝜆 < ∞.

We do not have a compelling intuition for this correspondence.

The Hellinger distance is more sensitive to areas where the densi- ties are close to zero. This causes that the approach in the preceding lemma does not give sharp results. The following lemma does, but is restricted to the Laplace kernel.

Lemma 1.4. For a probability measure𝐺 supported on [−𝑎, 𝑎] there exists a discrete measure𝐺with at most𝑁 ≍ 𝜀−2/3support points

10

(22)

1.4 finite approximation

such that for𝑝𝐺 = 𝑓 ∗ 𝐺 and 𝑓 the Laplace density ℎ(𝑝𝐺, 𝑝𝐺) ≤ 𝜀.

Proof. Since𝑝𝐺(𝑥) ≥ 𝑓(|𝑥| + 𝑎) = 𝑒−𝑎𝑒−|𝑥|/2, for every 𝑥 and proba- bility measure𝐺 supported on [−𝑎, 𝑎], the Hellinger distance between Laplace mixtures satisfies

2(𝑝𝐺, 𝑝𝐺) ≤ ∫(𝑝𝐺− 𝑝𝐺)2

𝑝𝐺+ 𝑝𝐺 (𝑥) 𝑑𝑥 ≤ 𝑒𝑎∫(𝑝𝐺(𝑥) − 𝑝𝐺(𝑥))2𝑒|𝑥|𝑑𝑥.

If we write𝑞𝐺(𝑥) = 𝑝𝐺(𝑥)𝑒|𝑥|/2, and𝑞𝐺̃ for the corresponding Fourier transform, then by Plancherel’s theorem the integral in the right side is equal to

1

2𝜋∫ | ̃𝑞𝐺− ̃𝑞𝐺|2(𝜆) 𝑑𝜆.

By an explicit computation we obtain

̃

𝑞𝐺(𝜆) = 1

2∫ ∫ 𝑒𝚤𝜆𝑥𝑒−|𝑥−𝑧|+|𝑥|/2𝑑𝑥 𝑑𝐺(𝑧) =1

2∫ 𝑟(𝜆, 𝑧) 𝑑𝐺(𝑧), where𝑟(𝜆, 𝑧) is given by

𝑟(𝜆, 𝑧) = 𝑒−𝑧

𝚤𝜆 + 1/2+ 𝑒−𝑧𝑒(𝚤𝜆+3/2)𝑧− 1

𝚤𝜆 + 3/2 −𝑒(𝚤𝜆+1/2)𝑧 𝚤𝜆 − 1/2

= 𝑒−𝑧

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)− 2𝑒𝚤𝜆𝑧𝑒𝑧/2

(𝚤𝜆 + 3/2)(𝚤𝜆 − 1/2). (1.4) Now let𝐺be a discrete measure on[−𝑎, 𝑎] such that

∫ 𝑒−𝑧𝑑(𝐺− 𝐺)(𝑧) = 0,

∫ 𝑒𝑧/2𝑧𝑗𝑑(𝐺− 𝐺)(𝑧) = 0, ∀ 0 ≤ 𝑗 ≤ 𝑘 − 1.

By Lemma A.1 in [31]𝐺can be chosen to have at most𝑘 + 1 support points.

By the choice of𝐺the first term of𝑟(𝜆, 𝑧) gives no contribution to the difference∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺−𝐺)(𝑧). As the second term of 𝑟(𝜆, 𝑧) is for large|𝜆| bounded in absolute value by a multiple of |𝜆|−2, it follows that

𝐼2∶= ∫

|𝜆|>𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆 ≲ ∫

𝜆>𝑀

𝜆−4𝑑𝜆 ≍ 𝑀−3.

(23)

dirichlet-laplace mixtures

Again by the choice of𝐺, the integral∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧) remains the same if we replace𝑒𝚤𝜆𝑧by𝑒𝚤𝜆𝑧− ∑𝑘𝑗=0(𝚤𝜆𝑧)𝑗/𝑗! in the second term of𝑟(𝜆, 𝑧). It follows that

𝐼1∶= ∫

|𝜆|≤𝑀

|∫ 𝑟(𝜆, 𝑧) 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆

≤ ∫

|𝜆|≤𝑀

| 2

(𝚤𝜆 + 1/2)(𝚤𝜆 + 3/2)|

2

| ∫ 𝑒𝑧/2[𝑒𝚤𝜆𝑧

𝑘

𝑗=0

(𝚤𝜆𝑧)𝑗] 𝑑(𝐺− 𝐺)(𝑧)|

2

𝑑𝜆

≲ ∫

𝑀 0

(𝑧𝜆)2𝑘

(𝑘!)2 𝑑𝜆 ≲ (𝑎𝑒𝑀)2𝑘+1 𝑘2𝑘+1 .

It follows, by a similar argument as in the proof of Lemma 1.3, that we can reduce both𝐼1and𝐼2to𝜀2by choosing and𝑀 ≍ 𝜀−2/3and 𝑘 = 2𝑎𝑒𝑀.

1.5 entropy

We study the covering numbers of the class of mixtures𝑝𝐺 = 𝑓 ∗ 𝐺, where𝐺 ranges over the collection ℳ[−𝑎, 𝑎] of all probability mea- sures on[−𝑎, 𝑎]. We present a bound for any 𝐿𝑟-norm and general kernels𝑓, and a bound for the Hellinger distance that is specific to the Laplace kernel. Note that𝑓(1)is the first derivative of𝑓.

Proposition 1.5. If both‖𝑓‖𝑟and‖𝑓(1)𝑟are finite and ̃𝑓 has ordinary smoothness𝛽, then, for 𝑝𝐺= 𝑓 ∗ 𝐺, and any 𝑟 ≥ 2,

log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ‖⋅‖𝑟) ≲ 𝜀𝛽−1+1/𝑟1 log 𝜀−1. (1.5) Proof. Consider an𝜀-net of 𝒢𝑎 = {𝑝𝐺 ∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]} by con- structing ℐ the collection of all𝑝𝐺’s such that the mixing measure 𝐺 ∈ ℳ[−𝑎, 𝑎] is discrete and has at most 𝑁 ≤ 𝐷𝜀−(𝛽−1+𝑟−1)−1support points for some proper constant𝐷.

In light of Lemma 1.3, the set of all mixtures𝑝𝐺with𝐺 a discrete probability measure with𝑁 ≲ 𝜀−(𝛽−1+𝑟−1)−1 support points forms an 𝜀-net over the set of all mixtures 𝑝𝐺 as in the lemma. It suffices to construct an𝜀-net of the given cardinality over this set of discrete mix- tures.

12

(24)

1.5 entropy

By Jensen’s inequality and Fubini’s theorem,

‖𝑓(⋅ − 𝜃) − 𝑓‖𝑟= (∫|𝜃 ∫

1

0

𝑓(1)(𝑥 − 𝜃𝑠) 𝑑𝑠|𝑟𝑑𝑥)

1/𝑟

≤ ‖𝑓(1)𝑟𝜃.

Furthermore, for any probability vectors𝑝 and 𝑝and locations𝜃𝑖,

𝑁

𝑖=1

𝑝𝑖𝑓(⋅ − 𝜃𝑖) −

𝑁

𝑖=1

𝑝𝑖𝑓(⋅ − 𝜃𝑖)‖

𝑟

𝑁

𝑖=1

|𝑝𝑖− 𝑝𝑖|‖𝑓(⋅ − 𝜃𝑖)‖𝑟

= ‖𝑓‖𝑟‖𝑝 − 𝑝1.

Combining these inequalities, we see that for two discrete probability measures𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺= ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖,

‖𝑝𝐺− 𝑝𝐺𝑟 ≤ ‖𝑓(1)𝑟max

𝑖 |𝜃𝑖− 𝜃𝑖| + ‖𝑓‖𝑟‖𝑝 − 𝑝1. (1.6) Thus we can construct an𝜀-net over the discrete mixtures by relocat- ing the support points(𝜃𝑖)𝑁𝑖=1to the nearest points(𝜃𝑖)𝑁𝑖=1in an𝜀-net on[−𝑎, 𝑎], and relocating the weights 𝑝 to the nearest point 𝑝in an 𝜀-net for the 𝑙1-norm over the𝑁-dimensional 𝑙1-unit simplex. This gives a set of at most

(2𝑎 𝜀 )

𝑁

(5 𝜀)

𝑁

∼ (10𝑎 𝜀2 )

𝑁

measures𝑝𝐺(cf. Lemma A.4 of [33] for the entropy of the𝑙1-unit sim- plex). This gives the bound of the lemma.

Proposition 1.6. For𝑓 the Laplace kernel and 𝑝𝐺= 𝑓 ∗ 𝐺,

log 𝑁(𝜀, {𝑝𝐺∶ 𝐺 ∈ ℳ[−𝑎, 𝑎]}, ℎ) ≲ 𝜀−3/8log(𝜀−1). (1.7) Proof. Since the function √𝑓 is absolutely continuous with derivative 𝑥 ↦ −2−3/2𝑒−|𝑥|/2sgn(𝑥), we have by Jensen’s inequality and Fubini’s theorem that

2(𝑓, 𝑓(⋅ − 𝜃)) = ∫(𝜃 ∫

1

0

−2−3/2𝑒−|𝑥−𝜃𝑠|/2sgn(𝑥 − 𝜃𝑠) 𝑑𝑠)2𝑑𝑥

≤ 𝜃2

1

0

∫ 𝑒−|𝑥−𝜃𝑠|𝑑𝑥 𝑑𝑠 = 2𝜃2. It follows thatℎ(𝑓, 𝑓(⋅ − 𝜃)) ≲ 𝜃.

(25)

dirichlet-laplace mixtures

By convexity of the map(𝑢, 𝑣) ↦ (√𝑢 − √𝑣)2, we have

|

|

√∑𝑖

𝑝𝑖𝑓(⋅ − 𝜃𝑖) − √∑

𝑖

𝑝𝑖𝑓(⋅ − 𝜃𝑖)|

|

2

≤ ∑

𝑖

𝑝𝑖[√𝑓(⋅ − 𝜃𝑖) − √𝑓(⋅ − 𝜃𝑖)]2.

By integrating this inequality we see that the densities𝑝𝐺 and𝑝𝐺 with mixing distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖and𝐺= ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 satisfy ℎ2(𝑝𝐺, 𝑝𝐺) ≲ ∑ 𝑝𝑖|𝜃𝑖− 𝜃𝑖|2≤ ‖𝜃 − 𝜃2.

For distributions𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 and𝐺 = ∑𝑁𝑖=1𝑝𝑖𝛿𝜃𝑖 with the same support points, but different weights, we have

2(𝑝𝐺, 𝑝𝐺) ≤ ∫( ∑𝑖=1𝑁 (𝑝𝑖− 𝑝𝑖)𝑓(𝑥 − 𝜃𝑖))2

𝑁𝑖=1(𝑝𝑖+ 𝑝𝑖)𝑓(𝑥 − 𝜃𝑖) 𝑑𝑥

≤ ∫ (

𝑁

𝑖=1

|𝑝𝑖− 𝑝𝑖|)2𝑓2(|𝑥| − 𝑎)

2𝑓(|𝑥| + 𝑎)𝑑𝑥 ≲ ‖𝑝 − 𝑝21. Therefore the bound follows by arguments similar as in the proof of Proposition 1.5, where presently we use Lemma 1.4 to determine suit- able finite approximations.

The map𝐺 ↦ 𝑝𝐺 = 𝑓 ∗ 𝐺 is one-to-one as soon as the charac- teristic function of𝑓 is never zero. Under this condition we can also view the Wasserstein distance on the mixing distribution as a distance on the mixtures. Obviously the covering numbers are then free of the kernel.

Proposition 1.7. For any𝑘 ≥ 1, and any sufficiently small 𝜀 > 0, log 𝑁(𝜀, ℳ[−𝑎, 𝑎], 𝑊𝑘) ≲ 𝜀−1log 𝜀−1. (1.8) The proposition is a consequence of Lemma 1.9, below, which ap- plies to the set of all Borel probability measures on a general metric space(𝛩, 𝜌) (cf. [61]).

Lemma 1.8. For any probability measure𝐺 concentrated on countably many disjoint sets𝛩1, 𝛩2, … and probability measure 𝐺concentrated on disjoint sets𝛩1, 𝛩2, …,

𝑊𝑘(𝐺, 𝐺) ≤ sup

𝑖

sup

𝜃𝑖∈𝛩𝑖 𝜃𝑖∈𝛩𝑖

𝜌(𝜃𝑖, 𝜃𝑖) + diam(𝛩)( ∑

𝑖

|𝐺(𝛩𝑖) − 𝐺(𝛩𝑖)|)1/𝑘.

14

(26)

1.6 prior mass

In particular,

𝑊𝑘( ∑

𝑖

𝑝𝑖𝛿𝜃𝑖, ∑

𝑖

𝑝𝑖𝛿𝜃𝑖) ≤ max

𝑖 𝜌(𝜃𝑖, 𝜃𝑖) + diam(𝛩)‖𝑝 − 𝑝1/𝑘1 . Proof. For𝑝𝑖= 𝐺(𝛩𝑖) and 𝑝𝑖 = 𝐺(𝛩𝑖) divide the interval [0, ∑𝑖𝑝𝑖∧ 𝑝𝑖] into disjoint intervals 𝐼𝑖of lengths𝑝𝑖∧ 𝑝𝑖. We couple variables

̄𝜃 and ̄𝜃by an auxiliary uniform variable𝑈. If 𝑈 ∈ 𝐼𝑖, then generate

̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and ̄𝜃∼ 𝐺(⋅|𝛩𝑖). Divide the remaining interval [∑𝑖𝑝𝑖∧ 𝑝𝑖, 1] into intervals 𝐽𝑖of lengths𝑝𝑖− 𝑝𝑖∧ 𝑝𝑖and, separately, intervals 𝐽𝑖of length𝑝𝑖 − 𝑝𝑖∧ 𝑝𝑖. If𝑈 ∈ 𝐽𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩𝑖) and if𝑈 ∈ 𝐽𝑖, then generate ̄𝜃 ∼ 𝐺(⋅|𝛩𝑖). Then ̄𝜃 and ̄𝜃have marginal distributions𝐺 and 𝐺, and

𝔼𝜌𝑘( ̄𝜃, ̄𝜃) ≤ 𝔼[𝜌𝑘( ̄𝜃, ̄𝜃)1𝑈≤∑

𝑖𝑝𝑖∧𝑝𝑖] + diam(𝛩)𝑘ℙ(𝑈 > ∑

𝑖

𝑝𝑖∧ 𝑝𝑖).

The first term is bounded by the𝑘-th power of the first term of the lemma, while the probability in the second term is equal to1 − ∑𝑖𝑝𝑖∧ 𝑝𝑖 = ∑𝑖|𝑝𝑖− 𝑝𝑖|/2.

Lemma 1.9. For the set ℳ(𝛩) of all Borel probability measures on a metric space(𝛩, 𝜌), any 𝑘 ≥ 1, and 0 < 𝜀 < min{2/3, diam(𝛩)},

𝑁(𝜀, ℳ(𝛩), 𝑊𝑘) ≤ (4 diam(𝛩)

𝜀 )𝑘𝑁(𝜀,𝛩,𝜌).

Proof. For a minimal𝜀-net over 𝛩 of 𝑁 = 𝑁(𝜀, 𝛩, 𝜌) points, let 𝛩 =

𝑖𝛩𝑖be the partition obtained by assigning each𝜃 to a closest point.

For any𝐺 let 𝐺𝜀 = ∑𝑖𝐺(𝛩𝑖)𝛿𝜃𝑖, for arbitrary but fixed𝜃𝑖 ∈ 𝛩𝑖. Since 𝑊𝑘(𝐺, 𝐺𝜀) ≤ 𝜀 by Lemma 1.8, 𝑁(2𝜀, ℳ(𝛩), 𝑊𝑘) ≤ 𝑁(𝜀, ℳ𝜀, 𝑊𝑘) holds for ℳ𝜀 the set of all𝐺𝜀. We next form the measures𝐺𝜀,𝑝 =

𝑖𝑝𝑖𝛿𝜃𝑖for(𝑝1, … , 𝑝𝑁) ranging over an (𝜀/ diam(𝛩))𝑘-net for the𝑙1- distance over the𝑁-dimensional unit simplex. By Lemma 1.8 every 𝐺𝜀 is within𝑊𝑘-distance of some𝐺𝜀,𝑝. Therefore the proof is com- pleted because𝑁(𝜀, ℳ𝜀, 𝑊𝑘) is bounded from above by the number of points𝑝, which is bounded by (4 diam(𝛩)/𝜀)𝑘𝑁(cf. Lemma A.4 in [31]).

1.6 prior mass

This main result of this section is the following proposition, which gives a lower bound on the prior mass of the prior (i)-(iv) in a neigh-

(27)

dirichlet-laplace mixtures

bourhood of a mixture𝑝𝐺0.

Proposition 1.10. If𝛱 is the Dirichlet process DP(𝛼) with base mea- sure𝛼 that has a Lebesgue density bounded away from 0 and ∞ on its support[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then for every sufficiently small𝜀 > 0 and every probability measure 𝐺0on[−𝑎, 𝑎],

log 𝛱(𝐺 ∶ 𝐾(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2, 𝐾2(𝑝𝐺, 𝑝𝐺0) ≤ 𝜀2) ≳ (1

𝜀)2/3log (1 𝜀).

Proof. By Lemma 1.4 there exists a discrete measure𝐺1 with𝑁 ≲ 𝜀−2/3support points such thatℎ(𝑝𝐺0, 𝑝𝐺1) ≤ 𝜀. We may assume that the support points of𝐺1are at least2𝜀2-separated. If not, we take a maximal2𝜀2-separated set in the support points of𝐺1, and replace 𝐺1by the discrete measure obtained by relocating the masses of𝐺1 to the nearest points in the2𝜀2-net. Thenℎ(𝑝𝐺1, 𝑝𝐺1) ≲ 𝜀2, as seen in the proof of Proposition 1.6.

Now by Lemmas 1.11 and 1.12, if𝐺1= ∑𝑁𝑖=1𝑝𝑗𝛿𝑧𝑗, with the support points𝑧𝑗at least2𝜀2-separated,

{𝐺 ∶ max(𝐾, 𝐾2)(𝑝𝐺0, 𝑝𝐺) < 𝑑1𝜀2} ⊃ {𝐺 ∶ ℎ(𝑝𝐺0, 𝑝𝐺) ≤ 2𝜀}

⊃ {𝐺 ∶ ℎ(𝑝𝐺1, 𝑝𝐺) ≤ 𝜀}

⊃ {𝐺 ∶ ‖𝑝𝐺− 𝑝𝐺11≤ 𝑑2𝜀2}

⊃ {𝐺 ∶

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2}.

Since the base measure𝛼 has density bounded away from zero and infinity on[−𝑎, 𝑎] by assumption, by Lemma A.2 of [31], we have

log 𝛱(𝐺 ∶

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀2, 𝑧𝑗+ 𝜀2] − 𝑝𝑗| ≤ 𝜀2) ≳ −𝑁 log 𝜀−1.

The lemma follows upon combining the preceding.

Lemma 1.11. If𝐺 = ∑𝑁𝑗=1𝑝𝑖𝛿𝑧𝑗 is a probability measure supported on points𝑧1, … , 𝑧𝑁inℝ with |𝑧𝑗 − 𝑧𝑘| > 2𝜀 for 𝑗 ≠ 𝑘, then for any probability measure𝐺 on ℝ and kernel 𝑓 with its derivative 𝑓(1),

‖𝑝𝐺− 𝑝𝐺1≤ 2‖𝑓(1)1𝜀 + 2

𝑁

𝑗=1

|𝐺[𝑧𝑗− 𝜀, 𝑧𝑗+ 𝜀] − 𝑝𝑗|.

16

(28)

1.7 proof of theorem 1.1

Lemma 1.12. If𝐺 and 𝐺are probability measures on[−𝑎, 𝑎], and 𝑓 is the Laplace kernel, then

2(𝑝𝐺, 𝑝𝐺) ≲ ‖𝑝𝐺− 𝑝𝐺2, (1.9) max (𝐾(𝑝𝐺, 𝑝𝐺), 𝐾2(𝑝𝐺, 𝑝𝐺)) ≲ ℎ2(𝑝𝐺, 𝑝𝐺). (1.10) Proofs. The first lemma is a generalization of Lemma 4 in [33] from normal to general kernels, and is proved in the same manner. We omit further details.

In view of the shape of the Laplace kernel, it is easy to see that for 𝐺 compactly supported on [−𝑎, 𝑎],

𝑓(|𝑥| + 𝑎) ≤ 𝑝𝐺(𝑥) ≤ 𝑓(|𝑥| − 𝑎), We bound the squared Hellinger distance as follows:

2(𝑝𝐺, 𝑝𝐺) ≤ ∫(𝑝𝐺− 𝑝𝐺)2 𝑝𝐺+ 𝑝𝐺 𝑑𝑥

≤ ∫

|𝑥|≤𝐴

𝑒𝐴+𝑎(𝑝𝐺− 𝑝𝐺)2𝑑𝑥 + ∫

|𝑥|>𝐴

(𝑝𝐺+ 𝑝𝐺)𝑑𝑥

≲ 𝑒𝑎‖𝑝𝐺− 𝑝𝐺22𝑒𝐴+ 𝑒−𝐴.

By the elementary inequality𝑡 +𝑢𝑡 ≥ 2√𝑢, for 𝑢, 𝑡 > 0, we obtain (1.9) upon choosing𝐴 = min(𝑎, log ‖𝑝𝐺− 𝑝𝐺−12 − 𝑎/2).

For the proof of the second assertion we first note that, if both𝐺 and𝐺are compactly supported on[−𝑎, 𝑎],

𝑝𝐺(𝑥)

𝑝𝐺(𝑥)≤ 𝑓(|𝑥| − 𝑎) 𝑓(|𝑥| + 𝑎) ≤ 𝑒2𝑎.

Therefore‖𝑝𝐺/𝑝𝐺 ≤ 𝑒2𝑎, and (1.10) follows by Lemma 8 in [33].

1.7 proof of theorem 1.1

The basic theorem of [30] gives a posterior contraction rate in terms of a metric on densities that is bounded above by the Hellinger dis- tance. In the present situation for the proofs of Theorems 1.1 and 1.2, we would like to apply this result to a power smaller than one of the Wasserstein metric and a power smaller than one of the𝐿𝑞distance, respectively, both of which are not metrics.

Consider a general “discrepancy measure”𝑑, which is a map 𝑑 ∶ 𝒫 × 𝒫 → ℝ on the product of the set of densities on a given mea-

(29)

dirichlet-laplace mixtures

surable space and itself, which has the properties, for some constant 𝐶 > 0:

(a) 𝑑(𝑥, 𝑦) ≥ 0;

(b) 𝑑(𝑥, 𝑦) = 0 if and only if 𝑥 = 𝑦;

(c) 𝑑(𝑥, 𝑦) = 𝑑(𝑦, 𝑥);

(d) 𝑑(𝑥, 𝑦) ≤ 𝐶(𝑑(𝑥, 𝑧) + 𝑑(𝑦, 𝑧)).

Thus𝑑 is a metric except that the triangle inequality is replaced with a weaker condition that incorporates a constant𝐶, possibly bigger than 1. Call a set of the form {𝑥 ∶ 𝑑(𝑥, 𝑦) < 𝑐} a 𝑑-ball, and define covering numbers𝑁(𝜀, 𝒫 , 𝑑) relative to 𝑑 as usual.

Let𝛱𝑛(⋅ ∣ 𝑋1, … , 𝑋𝑛) be the posterior distribution of 𝑝 given an i.i.d. sample𝑋1, … , 𝑋𝑛from a density𝑝 that is equipped with a prior probability distribution𝛱.

Theorem 1.13. Suppose𝑑 has the properties as given, satisfies 𝑑(𝑝0, 𝑝) ≤ ℎ(𝑝0, 𝑝) for every 𝑝 ∈ 𝒫 , and the sets {𝑝 ∶ 𝑑(𝑝, 𝑝) < 𝛿} are convex.

Then𝛱𝑛(𝑑(𝑝, 𝑝0) > 𝑀𝜀𝑛∣ 𝑋1, … , 𝑋𝑛) → 0 in 𝑃0𝑛-probability for any 𝜀𝑛such that𝑛𝜀2𝑛→ ∞ and such that, for positive constants 𝑐1,𝑐2and sets 𝒫𝑛⊂ 𝒫 ,

log 𝑁(𝜀𝑛, 𝒫𝑛, 𝑑) ≤ 𝑐1𝑛𝜀𝑛2, (1.11) 𝛱𝑛(𝑝 ∶ 𝐾(𝑝0, 𝑝) < 𝜀2𝑛, 𝐾2(𝑝0, 𝑝) < 𝜀2𝑛) ≥ 𝑒−𝑐2𝑛𝜀2𝑛, (1.12) 𝛱𝑛(𝒫 − 𝒫𝑛) ≤ 𝑒−(𝑐2+4)𝑛𝜀2𝑛. (1.13) We defer the proof until Appendix A.

The proof of Theorem 1.1 is based on the following comparison between the Wasserstein and Hellinger metrics. The lemma improves and generalizes Theorem2 in [61]. We choose constant 𝐶𝑘carefully to make sure that the map𝜀 ↦ 𝜀[log(𝐶𝑘/𝜀)]𝑘+1/2is monotone on(0, 2].

Lemma 1.14. For probability measures𝐺 and 𝐺supported on[−𝑎, 𝑎], and𝑝𝐺 = 𝑓∗𝐺 for a probability density 𝑓 with inf𝜆(1+|𝜆|𝛽)| ̃𝑓(𝜆)| > 0, and any𝑘 ≥ 1,

𝑊𝑘(𝐺, 𝐺) ≲ ℎ(𝑝𝐺, 𝑝𝐺)1/(𝑘+𝛽)(log 𝐶𝑘

ℎ(𝑝𝐺, 𝑝𝐺))(𝑘+1/2)/(𝑘+𝛽)

.

Proof. By Theorem 6.15 in [76] the Wasserstein distance𝑊𝑘(𝐺, 𝐺) is bounded above by a multiple of the𝑘th root of ∫ |𝑥|𝑘𝑑|𝐺 − 𝐺|(𝑥), where|𝐺 − 𝐺| is the total variation measure of the difference 𝐺 −

18

Referenties

GERELATEERDE DOCUMENTEN

To sum up, the estimator works as proven in the main Theorem 3.2, but the exact performance depends on the true pa function and the degree of interest—if the true pa function

Nonetheless in this chapter, with a branching process framework sim- ilar to that in Chapter 3, we show that the maximum likelihood estima- tor (mle)

Then we briefly explain the history of the preferential attach- ment model and its relevance in dynamic network modeling and rig- orously formulate the mathematical model of

For the most part of my four-year time in Leiden, it was my rou- tine to go to ballroom dancing lessons in the university sports center every Wednesday evening.. I thank

Network science emerged as an independent and prominent discipline upon the marriage of random graph theory to complex systems.. What distinguishes statisticians from

threshold (17% for update group of size 3) inflexibles make the separator and the mixed phase attractor to coalesce and thus cancel each other to both disappear.. Their side

Like most cocktails, protoplanetary disks are better appreciated if well mixed. Leon Trapman Leiden, 5

How to analyze linguistic change using mixed models, Growth Curve Analysis and Generalized Additive Modeling. Generalized additive models: an introduction