Nonparametric Density Estimation and Monotone Rearrangement

(1)

Nonparametric Density Estimation and

Monotone Rearrangement

Robben Riksen

July 16, 2014

Bachelor Thesis

Supervisor: dr. A.J. van Es

Korteweg-de Vries Institute for Mathematics

(2)

(3)

Abstract

In nonparametric density estimation, the method of kernel estimators is commonly used. However, if extra information about the monotonicity of the target density is available, kernel estimation does not take this into account and will in most cases not give a monotone estimate. A recently appeared method in this field of research is the method of monotone rearrangement. Using this extra information to improve the density estimate, it has some interesting properties. In this thesis, nonparametric density estimation by means of kernel estimators is discussed. The reflection method is introduced to improve kernel estimation if the target density equals zero on the negative real line and has a discontinuity in zero. In the case where information about monotonicity of the target density is available, monotone rearrangement is used to monotonize the original estimate. The theoretical and asymptotic properties of the reflection method and monotone rearrangement will be discussed briefly, after which simulations in Matlab are run to measure the performance, in terms of MISE, of both methods separately and applied together. It appears to be the case that both the reflection method and monotone rearrangement greatly reduce the MISE of the kernel estimator, but when applied together, the MISE remains of the level of that of the reflection method.

Cover image: a plot of an exponential-1 density (dashed-dotted line), a kernel estimate based on a sample of size n = 100 (dashed line) and its rearranged estimate (solid line). Title: Nonparametric Density Estimation and Monotone Rearrangement

Author: Robben Riksen, robben.riksen@student.uva.nl, 10188258 Supervisor: dr. A.J. van Es

Second grader: dr. I.B. van Vulpen Date: July 16, 2014

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(4)

(5)

Introduction

When I returned from my semester abroad at the University of Edinburgh, I had made up my mind about the master’s degree I wanted to pursue after graduating for my bach-elor’s degree: it had to be the master Stochastics and Financial Mathematics at the University of Amsterdam. For a longer time, my interest had gone out to statistics, so writing my bachelor thesis is this field of research was a logical step in the preparation for this master’s programme. When approached, Dr. Van Es suggested a thesis about Monotone Rearrangement, a method of improving density estimates of monotone prob-ability densities. Being such a recently developed method in this application, it sparked my interest and we decided it would be a suitable subject for a bachelor thesis.

A common challenge in statistics is the estimation of an unknown density function out of a given data set. For example, this can be valuable to gain information about the occurrence of certain events in a population. Estimating the probability density from which such a data set is sampled, can be done in parametric and a nonparametric way. If we assume or know that a sample is drawn from, say, a N (µ, σ2_{) distribution. We}

will apply parametric density estimation to estimate µ and/or σ. There are many ways of doing this, but the produced estimate will be normal distributed and its accuracy can only be improved by varying the parameters. If the assumption was right, this parametric way of density estimation is satisfying enough. But if we do not know if the density we want to estimate belongs to a known family of parameterized probability densities, the parametric approach obviously is not satisfying.

In nonparametric density estimation, these kind of assumptions are not made. We only assume the target density exists, and make no assumptions about a parametric family to which the target density might, or might not, belong. Again, there are many ways to obtain a density function from observed data in a nonparametric way, but the most common way to do this is with kernel estimators, the method we will discuss in this thesis.

The method of kernel estimation is designed to estimate smooth densities, but can also be applied to densities with discontinuities. We will take a closer look at densities that are equal to zero on the negative real line, and therefore have a discontinuity in zero; the boundary. To improve the kernel estimator in this case, the commonly used reflection method is applied. This method, introduced by Schuster (1985) [1], ‘reflects’ the data set around zero, that is, adds the negative of the data points to the data set and applies the kernel estimation method to this new, bigger set of data points. This greatly improves the kernel estimation method.

Suppose now there is some extra information about the target density: we know the density we want to estimate is monotonically decreasing, or increasing. The availability

(8)

of such information occurs often. For example, in biometrics, age-height charts, charts in which the expected height for children is plotted against their age, should be monotonic in age. In econometrics, demand functions are monotonic in price (both examples from [2]). And there are a lot of other examples in which a monotone density is expected. In these cases, kernel estimation still works fine. But the method of kernel estimators does not take this extra information into account, and will in most cases not produce a monotone estimate. Wanting to produce the best possible estimate from our data set, we would like to use all information available to minimize our estimation error.

To monotonize the density found by kernel estimation, the method of monotone rear-rangement is introduced. This relatively simple method, essentially rearranging function values in decreasing or increasing order, appeared first in the context of variational anal-ysis in 1952 in the book Inequalities by Hardy, Littlewood and P´olya [3], but was only very recently introduced in the context of density estimation. Monotone rearrangement, however, proved to have many interesting properties in this context. This was the mo-tivation to look into this subject further in this thesis

The goal of this thesis is to familiarize with the method of kernel estimation and monotone rearrangement, and investigate whether a combination of methods, designed to improve kernel estimation, improves the estimate of a target density even more. Therefore, we will apply the reflection method and monotone rearrangement together to estimate a monotone probability density out of a generated sample. This leads to the following research question: ‘Does the combination of the reflection method and monotone rearrangement significantly improve the estimation of a monotonic density from a given data sample?’ In an attempt to answer this question, these methods were implemented in Matlab and simulations were run with different sample sizes.

This thesis will start with giving an overview of the method of kernel estimators, by introducing them in an intuitive way. Then, measures of how well an estimator estimates the target density will be introduced and discussed, after which several problems with the original kernel estimators are pointed out in the case where they are used to estimate target densities with a discontinuity in zero. A solution is proposed in the form of the reflection method. After that, the method of monotone rearrangement is introduced and some properties of this method are discussed. Then simulations are run to answer the main question of this thesis: ‘Does the combination of the reflection method and monotone rearrangement significantly improve the estimation of a monotonic density from a given data sample?’ The results of the simulations are discussed and possible improvements to the simulation and estimation methods are proposed.

(9)

1. Theory

This section will start with introducing the concept of kernel estimation in an intuitive way. The reflection method [1] will be introduced as a method to improve kernel esti-mation for densities that have a bounded support. An example of such a density is the exponential density, which is zero for all points on the negative real axis and has a dis-continuity in 0. Then monotone rearrangement will be introduced and applied, to take into account extra information about monotonicity of the target density. Throughout this chapter, unless stated otherwise, the same data set with 100 points drawn from a N (µ, σ2_{) distribution, with µ = 2 and σ}2 ₌ 1

2, will be used in the examples to illustrate

the theory. The value of these data points can be found in Appendix C.1. Furthermore, in general descriptions and theory it will be assumed that a sample of n independently, identical distributed observations X1, X2, . . . , Xn is given for which the density, f , will

be estimated.

1.1. Kernel Estimation of Smooth Densities

In this section we will introduce the method of kernel estimation, by starting with the histogram method as a density estimator. We then expand this idea to introduce kernel estimators and find a smooth estimate of the target density.

1.1.1. The Histogram Method

Probably the most intuitive way of representing the probability density from which a data set is sampled, is by use of a histogram. To do this, the bin width h and starting point x0 are chosen. The bins are defined as the half open intervals [x0+ (m − 1)h, x0+ mh),

for m ∈ Z. Then the histogram density approximation is defined as ˆ

f (x) = 1

nh· {no. of Xi in same bin as x}.

For the data set from Appendix C.1 this definition, with starting point x0 = 0 and

bin width h = 0.1708, results in the density ˆf as depicted in figure 1.1.

Clearly, the histogram method for approximating the density is not at all satisfactory. The approximation is not a continuous function, but the major drawback is that the choice of the starting point x0 has a big influence on the approximated density, as is

shown figure 1.2.

The starting point is shifted to the left by 0.10 to x0 = −0.1 while the bin width

(10)

Figure 1.1.: Histogram for the data set from Appendix C.1, with starting point x0 = 0

and bin width h = 0.1708.

Figure 1.2.: Histogram for the data set from Appendix C.1, with starting point x0 = −0.1

(11)

separated from the rest of the bins visible in figure 1.1 has almost disappeared in figure 1.2.

If we now, instead of placing the data points in bins, place ‘bins’ around the data points, we get rid of the problem of the starting point. The only variable then will be the shape of the ‘bin’ and the bin width h. This idea leads to the definition of kernel estimators, which we will introduce below.

1.1.2. Kernel Estimators

A far more satisfactory method to estimate the probability density, is a family of methods based on kernel estimators. First a kernel function K, a function that places weight on (and around) the observations, has to be introduced. This kernel function, or simply kernel, K has to integrate to 1 over the real line. Now the kernel estimator can be defined as follows.

Definition 1.1. Let X1, X2, . . . , Xnbe independently, identical distributed observations

with density f . A kernel estimator of the target function f , ˆf , with kernel K, is defined by ˆ f (x) = 1 nh n X i=1 K x − Xi h , (1.1)

where n is the number of data points and, from now on, h is called the bandwidth. A kernel K is said to be of order m if

Z

Kn(x)dx = 0, for 0 < n < m, and Z

Km(x)dx 6= 0. (1.2)

The mostly used symmetrical kernel are of order 2, because otherwise it can appear that the kernel estimate takes negative values in some areas, which is in most cases not desirable when estimating a probability density.

An example of a commonly used kernel function is the standard normal density tion. To visualize the working of a kernel estimator using the normal distribution func-tion as a kernel, imagine small normal densities placed over the data points Xi that

add up to the estimated probability density. This is illustrated in figure 1.3, where we used the first 4 data points from sample 1 in Appendix C.1 and bandwidth h = 0.5 to construct the density estimation ˆf .

Varying the bandwidth h has big effects on the density estimate ˆf . Choosing h too small, will result in too much weight being given to single data point in the tail of the distribution, while choosing h too big will result in over smoothing and might conceal structures in the target density. Therefore, the selected bandwidth will largely determine the behaviour of the kernel estimator. This is illustrated by figure 1.4, where the bandwidth h = 0.3 is chosen.

A (non-existent) and before invisible structure around x = 3 has appeared by choosing the bandwidth too small.

(12)

Figure 1.3.: Kernel estimator for the N (2, 0.5)-distribution from 4 data points with bandwidth h = 0.5, showing individual kernels.

Figure 1.4.: Kernel estimator for the N (2, 0.5)-distribution from 4 data points with bandwidth h = 0.3, showing individual kernels.

(13)

Another property of a kernel estimator that determines its accuracy is the choice of the kernel. A huge variation in the choice of the kernel used in kernel estimation is possible. A commonly used symmetric kernel is the gaussian kernel we used in figure 1.3: K(t) = √1

2πe −1₂t2

. Another commonly used kernel is the Epanechnikov kernel Ke:

Ke(t) = ( 3 4√5 1 − 1 5t 2 if −√5 ≤ t ≤√5 0 otherwise (1.3)

This Kernel, sometimes called the optimal kernel, was introduced by Epanechnikov (1996) [4] and is the most efficient kernel function in terms of minimizing the MISE, although the difference compared to other kernels relatively small. Its drawback, on the other hand, is that is not continuously differentiable. The Kernels we just discussed will be the same for all data points. Variable kernels can be introduced, in which the shape of the kernel can depend on the position of the data point around which it is placed, or its distance to other data points. Depending on the information known about the data set, this can greatly increase the accuracy of the estimate.

1.2. Defining Errors

To compare different estimation methods and to draw conclusions about which method improves the estimation, it is necessary to define what measures of error will be used to compare different techniques. In the derivations in this section, the line of Density Estimation for Statistics and Data Analysis, by Silverman (1986) [5], will be followed.

For the error in the estimation in a single point, the ’mean square error’ is defined in the usual way.

Definition 1.2. The mean square error is given by

MSEx( ˆf ) := E[ ˆf (x) − f (x)]2, (1.4)

where ˆf (x) is the estimatior of the density f .

It is easy to see, using the expression for the variance of ˆf , Var ˆ_{f (x) = E ˆ}f2_{(x)−(E ˆ}_{f (x))}2_,

that the MSE can be written in the following way

MSEx( ˆf ) =

E ˆf (x) − f (x)

2

+ Var ˆf . (1.5)

We will now define the bias of ˆf .

Definition 1.3. Let ˆf (x) be the estimation of the density f . Then

b_fˆ(x) := E ˆf (x) − f (x), (1.6)

(14)

So in a single point, the mean square error is given by a bias term, representing the systematic error of ˆf , and a variance term. To quantify the error on the whole domain of f , the ‘mean integrated square error’ is introduced.

Definition 1.4. The mean integrated square error is given by

MISE( ˆ_{f ) := E} Z

h ˆ_{f (x) − f (x)}i2

dx. (1.7)

The MISE is a commonly used method of quantifying the error in an estimate. Now, because the integrant is non-negative, the expectation and the integral can be inter-changed, and the MISE takes the following form [6].

MISE( ˆ_{f ) : = E} Z h ˆ_{f (x) − f (x)}i2 dx = Z MSEx( ˆf )dx = Z h E ˆf (x) − f (x)i 2 dx + Z Var ˆf (x)dx = Z b2_f_ˆ(x)dx + Z Var ˆf (x)dx. (1.8)

Thus, the MISE is the sum of the integrated square bias and the integrated variance. If in this definition ˆf is taken as in definition 1.1, its expectation value is [7]

E ˆf (x) = Z ₁ hK x − y h f (y)dy, (1.9)

and its variance can be found as

Var ˆf (x) = Var " 1 n n X i=1 1 hK x − Xi h # = 1 n2 n X i=1 Var 1 hK x − Xi h = 1 n E 1 h2K 2 x − Xi h − E 1 hK x − Xi h 2! = 1 n Z ₁ h2K 2 x − y h f (y)dy − 1 n 1 h Z K x − y h f (y)dy 2 . (1.10)

Written in this form, it becomes clear that the bias term in the MISE does not directly depend on the sample size, thus this term will not be reduced just by taking larger samples.

(15)

From now on, it will be assumed that the unknown density has continuous derivatives of all required orders and the kernel function K used to define ˆf is a symmetric function with the following properties:

Z K(x)dx = 1 (1.11) Z xK(x)dx = 0 (1.12) Z x2K(x)dx = k2 6= 0. (1.13)

In other words, we are dealing with a symmetric kernel of order 2. These properties are met by most commonly used kernel functions, for in many cases the kernel K is a symmetric probability density with variance k2. The kernel functions used in the

simulations later on in this thesis will also satisfy these properties.

To calculate the MISE of an estimation, it is useful to derive a more applicable (ap-proximate) form of the bias and variance. So for the rest of this section we will take on a heuristic approach. We can rewrite the bias in the following way

b_fˆ(x) = E ˆf (x) − f (x) = Z 1 hK x − y h f (y)dy − f (x) ∗ = Z K(t)f (x − ht)dt − f (x) (1.14) ∗∗ = Z K(t) [f (x − ht) − f (x)] dt, (1.15)

where at ∗ a change of variable y = x−ht is applied and at ∗∗ it is used that K integrates to unity. If the Taylor expansion

f (x − ht) = f (x) − htf0(x) + h2t2f

00_(x)

2 + . . . (1.16) is substituted in expression 1.15 the following expression for the bias is found:

b_fˆ(x) = −hf0(x) Z tK(t)dt + 1 2h 2_f00 (x) Z t2K(t)dt + . . . = ∞ X p=1 (−1)php f (p) p! Z tpK(t)dt = ∞ X p=1 (−1)phpb(p)(x), (1.17)

where b(p)(x) = f_p!(p) R tpK(t)dt. Using properties (1.12) and (1.13) of K, the bias can be approximated by

(16)

b_fˆ(x) = h2b(2)(x) + o(h4) ≈ h2b(2)(x). (1.18)

So, the integrated square bias can be approximated by

Z b2_f_ˆ(x)dx ≈ 1 4h 4_k2 2 Z (f00(x))2dx. (1.19)

In a similar manner, an approximation of the variance of ˆf can be found. Substitution of expression (1.9) into (1.10), gives

Var ˆf (x) = 1 n Z 1 h2K 2 x − y h f (y)dy − 1 n h f (x) + b_fˆ(x) i2 .

The substitution y = x − ht in the integral and approximation (1.18) for the bias, leads to the following approximation for the variance:

Var ˆf (x) ≈ 1 nh Z K2(t)f (x − ht)dt − 1 nf (x) + O(h 2₎2 .

With the Taylor expansion (1.16) and the assumption that n is large and h is small, this can be rewritten as Var ˆf (x) ≈ 1 nh Z K2(t) {f (x) − htf0(x) + . . .} dt + O(n−1) = 1 nhf (x) Z K2(t)dt + O(n−1) ≈ 1 nhf (x) Z K2(t)dt. (1.20)

Because f is a probability density, R f (x)dx = 1, so the integral over the variance of ˆf can be approximated by Z Var ˆf (x)dx ≈ 1 nh Z K2(t)dt. (1.21)

This means we can approximate the MISE by [6]

MISE = Z b2_f_ˆ(x)dx + Z Var ˆf (x)dx ≈ 1 4h 4_k2 2 Z (f00(x))2dx + 1 nh Z K2(t)dt. (1.22)

(17)

The last expression (1.22) is known as the Asymptotic Mean Integrated Square Error, AMISE, and is a useful approximation to the MISE when dealing with large samples and much easier to calculate than the MISE in equation (1.8) [8].

It now becomes clear that decreasing the bandwidth h, whilst reducing the integrated square bias term in (1.8), increases the integrated variance term. This means that finding the optimal bandwidth will always be a trade-off between random and systematic error. This is known as the variance-bias trade-off. More about bandwidth selection will be said in section 1.3.

Besides the MISE as a measure of how well a density estimator approaches the target density, the asymptotic properties of the estimator are also important. Prakasa Rao (1983) [6] proved that there is no reasonable estimator ˆfn(x) such that

Eh ˆfn(x)

i

= f (x), (1.23)

so we are forced to look for asymptotically unbiased estimators. That is, a sequence of density estimators ˆfn is asymptotically unbiased if, for every density f and x,

lim n→∞Ef h ˆ_f n(x) i = f (x). (1.24)

We call a sequence of density estimators ˆfn weakly consistent if

ˆ fn(x)

p

→ f (x) as n → ∞. (1.25) Both definitions are valuable when looking for an appropriate density estimator.

1.3. Bandwidth Selection

As stated before, a good choice of the bandwidth h in a kernel estimator is very impor-tant, for the bandwidth will largely determine the behaviour of the estimator. We will define the optimal bandwidth as that bandwidth which minimizes the AMISE (1.22). In this thesis we will not go into detail about different methods to determine the optimal bandwidth, but we will just state the theoretical asymptotically optimal bandwidth.

Parzen (1962) [9] showed that the asymptotically optimal bandwidth for a two times continuously differentiable density f is equal to

hopt_n = k− 2 5 2 Z K(t)2dt 1₅ Z f00(x)2dx −1₅ n−15, (1.26)

where k2 is the same as in equation (1.13). In this thesis however, we will be dealing

(18)

bandwidth is derived for non-smooth densities. Let D be the set of discontinuity points of f , then the asymptotically optimal bandwidth is given by

hopt_n = R K(t) 2_dt R∞ 0 R∞ t K(u)du 2 dt !1₂ X d∈D (f (d+) − f (d−))2 !−1₂ n−12, (1.27)

where f (d+) = limh→0f (d + h) and f (d−) = limh→0f (d − h). So we conclude that for

smooth densities, the optimal bandwidth is of order n−15, and for discontinuous densities

the optimal bandwidth is of order n−12. In practice, for finite samples, the optimal

bandwidth has to be approximated. There are many methods of finding a suitable bandwidth, but in the simulations later on in this thesis we will use the Sheather and Jones bandwidth selection procedure [11].

Now we have defined the measures by which we can measure the ‘quality’ of a density estimator and know what the theoretical optimal bandwidths are, we can move on to the different methods of improving the estimator found by kernel estimation.

1.4. Decreasing Densities

When a probability density function is monotonically decreasing, the lower bound of the interval on which this density is defined has to exist. Otherwise this function can never be a probability density, for it will not integrate to unity. So the density function will take on positive (non-zero) values on an interval [al, ∞) and will be equal to zero

on the interval (−∞, al). This means, for a monotonely decreasing density, there has to

be a discontinuitiy in x = 0. However, most kernel estimators will not correct for this discontinuity and will have postive values on the interval where the target density equals zero. Especially at this boundary, the density estimate will be inaccurate and even not consistent with the target density. The problems induced by such a boundary are called boundary effects (or boundary constraints). Kernel estimators not taking into account these boundary constraints we will call unconstrained estimators. In this section we will discuss the reflection method. A method that corrects the kernel density estimate to these boundary effects.

1.4.1. Reflection Method

In the previous sections, the kernel density method is introduced as a method to estimate a target density that satisfies certain smoothness criteria. For example, in the derivation of the AMISE (1.22) we assumed the target function f to have a continuous second derivative over the whole real line. In many cases densities are used that do not satisfy these conditions. A common example of such a density is the exponential-λ density, f (x) = λe−λx_{1 {x ≥ 0}, which has an obvious discontinuity in x = 0 and only positive} values on the positive real line. From figure 1.5, in which an exponential-1 density is

(19)

estimated by an kernel estimator with an Epanechnikov kernel from a sample of n = 100 observations (sample 2 in Appendix C.2), it becomes clear that the estimator gives too little weight to the points close to the boundary on the positive x-axis and too much (namely any) to points on the negative axis.

Figure 1.5.: Kernel density estimation with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line).

So such boundary effects influence the performance of the estimator near the boundary [8],[12]. To quantify this, suppose f is a density with f (x) = 0, for x < 0, and f (x) > 0, for x ≥ 0, and has a continuous second derivative away from x = 0. Let ˆf be a kernel estimator of f based on a kernel K with support [−1, 1] and bandwidth h. We express a point x as x = ch, h ≥ 0. With the same change of variable as in equation (1.14) we find for the expectation of ˆf

E ˆf (x) = Z c

−1

K(t)f (x − ht)dt. (1.28)

Now if c ≥ 1, that is x ≥ h, we saw in equation (1.18) that

E ˆf (x) = f (x) + h2b(2)(x) + o(h4), (1.29) for a kernel of order 2. If on the other hand 0 ≤ c < 1, we find after Taylor expansion of f (x − ht) around x

(20)

E ˆf (x) = f (x) Z c

−1

K(t)dt + o(h). (1.30)

Because in general R₋₁c K(t)dt 6= 1, ˆf is not consistent in points close to x = 0. By the assumption that the kernel K is symmetric, we find E ˆf (0) = 1₂f (0) + o(h) at the boundary x = 0. This can be explained intuitively by the kernel estimator that has to be a smooth function on the whole line, but is an estimation of the target function f that is not continuous in x = 0.

To create a consistent kernel estimator near the boundary there exist multiple meth-ods. In this section we will take a closer look at the reflection method, as introduced by Schuster (1985) [1]. All observations are mirrored in x = 0, so to the set of observations X1. . . Xn, the observations −X1. . . − Xn are added. Then on this new set of

observa-tions the kernel estimator is applied for x ∈ [0, ∞). This method can also be described in the form of a kernel estimator on X1. . . Xn for x ∈ [0, ∞) as

ˆ fR(x) = 1 nh n X i=1 K x − Xi h + K x + Xi h . (1.31)

Note that ˆfR(x) integrates to 1 on the positive real axis and is a density. If we apply

the reflection method to the same set of observations that was used in figure 1.5, this results in figure 1.6.

At first sight ˆfR(x) already is a better estimator of the target density f near the

boundary, and it no longer gives any weight to negative values of x. This results in an asymptotically unbiased estimate of f . Following the same steps as before

E ˆfR(x) = E 1 nh n X i=1 K x − Xi h + K x + Xi h = E1 hK x − X h + E1 hK x + X h = Z 1 hK x − y h f (y)dy + Z 1 hK x + y h f (y)dy = Z K(t)f (x − ht)dt + Z K(t)f (−x + ht)dt, (1.32)

where in the first integral the change of variable y = x − ht is applied, and in the second integral y = −x + ht. If we again express a point x close to the boundary as x = ch, h ≥ 0, c ∈ [0, 1), and consider that f (x) = 0 for x < 0 and K(t) = 0 for t < −1 and 1 < t, we find

(21)

Figure 1.6.: Reflection method with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line).

E ˆfR(x) = Z c −1 K(t)f (x − ht)dt + Z 1 c K(t)f (−x + ht)dt = f (x) Z c −1 K(t)dt + f (−x) Z 1 c K(t)dt + o(h), (1.33)

using Taylor expansion of f (x − ht) around x and f (−x + ht) around −x is used. So for x = 0 the expectation becomes f (0) + o(h) and we see that the estimator ˆfR(x)

is asymptotically unbiased at the boundary. To achieve a smaller bias of order O(h2₎

near the boundary, the generalized reflection method was introduced by Karunamuni and Alberts (2006) [13], but this is beyond the scope of this thesis.

1.5. Monotone Rearrangement

In some cases there is additional information about the target density. Suppose it is known that the probability density is a monotone function, it then is desirable to use this extra information to find a density estimate that matches the original density better. Clearly, the kernel estimation method discussed in the previous section does not take this information into account. In this section monotone rearrangement will be introduced, a method that adapts the approximated density gained by the use of kernel estimation to respect the given monotonicity of the target probability function.

(22)

In this chapter, unless stated otherwise, f will be the target density and ˆf the not necessarily monotone estimation of f , for example generated by a kernel method as discussed in the previous chapter.

1.5.1. Introducing Monotone Rearrangement

Let U be uniformly distributed on a compact interval A = [al, au] in R, and g : A → R

a strictly increasing and differentiable function. Now the distribution function Fg(y) of

g(U ) is proportional to [14]

Fg(y) =

Z au

al

1 {g(u) ≤ y} du + al= g−1(y), t ∈ g([al, au]). (1.34)

In the case where g is not a strictly increasing function, the distribution function is still given by the integral above, and we will define the isotone increasing rearrangement as the generalized inverse of the distribution function:

g∗_{(x) = inf{y ∈ R|F}g(y) ≥ x}. (1.35)

In other words, g∗ is the quantile function of Fg and Fg(y) is the measure of the set in

which g(u) ≤ y [3].

In a similar way the isotone decreasing rearrangement can be defined with distribution function

Fg(y) =

Z au

al

1 {g(u) ≥ y} du + al, t ∈ g([al, au]) (1.36)

and decreasing rearrangement:

g∗_{(x) = inf{y ∈ R|F}g(y) ≤ x}. (1.37)

This reasoning is valid for every compact interval A in R, so for reasons of convenience A = [0, 1] is used from now on.

In section 1.5.2 it is shown that the rearrangement of ˆf , as defined in (1.35) and (1.37), will always give a better approximation of the monotone target function, when applied to a non-monotonic estimation.

To illustrate the principle of monotone rearrangement, a simple example is studied. The method of monotone rearrangement as described above is applied to the function

g(x) = (2x − 1), x ∈ [0, 1]. (1.38) Clearly, this function is not monotone. Using definition (1.34) and (1.35), the increasing monotone rearrangement of the function g is given by

(23)

g∗(x) := inf y ∈ R | Z 1 0 1(2u − 1)2 ≤ y du ≥ x . (1.39)

So the rearrangement sorts the values of g on [0, 1] in ascending order. The result is plotted in figure 1.7.

Figure 1.7.: Increasing monotone rearrangement (dashed line) applied to the function g(x) = (2x − 1)2 _{(solid line).}

In the same way, using definition (1.37), the decreasing rearrangement is given by

g∗(x) = inf y ∈ R | Z 1 0 1(2u − 1)2 ≥ y du ≤ x . (1.40) Resulting in figure 1.8.

In section 1.4.1 we used the reflection method to improve the kernel estimator. From figure 1.6 it is clear that the resulting estimate is not a monotone decreasing function, while the target density is. Applying the isotone decreasing rearrangement as defined in (1.37) to the reflected density, results in figure 1.9.

If we plot the reflection estimate and its monotone rearrangement together and zoom in at the part around x = 2 where the reflection method estimate is not monotone, we see how the monotone rearrangement method ’rearranges’ the function values to find a monotone estimate, as is shown in figure 1.10.

As we just saw, the monotone rearrangement method monotonizes the kernel density estimate, to achieve a better estimate of the monotone target density. However, the rear-rangements achieved by equations (1.35) and (1.37) will not always be continuously dif-ferentiable functions. For example, if we take a look at the function f (x) = x+1

(24)

Figure 1.8.: Decreasing monotone rearrangement (dashed line) applied to the function g(x) = (2x − 1)2 (solid line).

Figure 1.9.: Decreasing monotone rearrangement applied to reflection method with Epanechnikov kernel with bandwidth h = 0.7 (dashed line) of an exponential-1 density (solid line)

and its isotone rearrangement, as depicted in figure 1.11 below, we clearly see that the rearrangement is not everywhere continuously differentiable. In some cases however, it

(25)

Figure 1.10.: Decreasing monotone rearrangement (dotted line) applied to reflection method with Epanechnikov kernel (dashed line) of an exponential-1 density (solid line)

might be necessary to find a differentiable estimator of f . To find an everywhere dif-ferentiable increasing rearrangement, the indicator function in the distribution function (1.34) can be approximated by a kernel in the following way [14]. Let Kd be a positive

kernel of order 2 with compact support [−1, 1] and hd the corresponding bandwidth.

Then Fg,hd(y) = 1 hd Z 1 0 Z y −∞ Kd g(u) − v hd dvdu (1.41)

is a smoothed version of the distribution function and

g_h∗

d(x) = inf{y ∈ R|Fg,hd(y) ≥ x} (1.42)

is called the smoothed increasing rearrangement of g. In the same way as before, the decreasing rearrangement can be defined via the smoothed distribution function

Fg,hd(y) = 1 hd Z 1 0 Z ∞ y Kd g(u) − v hd dvdu, (1.43) and becomes

(26)

Figure 1.11.: Increasing isotone rearrangement (dashed line) applied to the function f (x) = x + 1₄sin(4πx) (solid line).

g∗_h

d(x) = inf{y ∈ R|Fg,hd(y) ≤ x}. (1.44)

For sufficiently small hd, the smoothed rearrangements gh∗d and g

∗

hd of an unconstrained

probability density are still probability densities, and converge pointwise to the isotone rearrangements g∗ and g∗ respectively, as proved in Birke (2009) [14] (see Appendix A, theorem A.1 for the proof).

As pointed out in Neumeyer (2007) [15], both the smoothed and the isotone rear-rangement share the same rate of convergence. Depending on the properties required for the estimator, one of these method can be chosen. The smoothed rearrangement will be preferred if one requires a smooth estimator, however for the isotone rearrangement there is no need for the choice of bandwidth hdand flat parts of g will be better reflected

by g∗. In the rest of this thesis and in the simulations, we will choose to use the isotone rearrangement rather than the smoothed rearrangement.

1.5.2. Properties of Monotone Rearrangement

The method of monotone rearrangement has promising properties. Bennett and Sharp-ley (1988) [16] showed that the monotone rearrangement f∗ of a function f has the same

(27)

Lp-norm, p ∈ [1, ∞), as the function f : kf∗kp = kf kp. Chernozhukov et al. (2009) [2]

proved that monotone rearrangement of an unconstrained estimate of the target func-tion, decreases the estimation error in the Lp_{-norm, that is k ˆ}_f∗_{− f k}

p ≤ k ˆf − f kp (see

Appendix A, theorem A.3 for a proof). Taking p = 2 this implies that the MISE of the rearranged estimate will be smaller than that of the original unconstrained esti-mate. Neumeyer (2007) [15] stated that the smoothed rearrangement of an estimator ˆf is pointwise consistent if ˆf is pointwise consistent. Furthermore, she showed that the asymptotic behavior of the rearrangement is the same as that of the original uncon-strained estimator.

The above definitions of monotone rearrangement were only introduced for densities with a bounded support. In many purposes, as is the case with our example of the exponential-1 density, the support of the target density is unbounded. For decreasing densities it will often be of the form A = [al, ∞) and for increasing densities of the form

A = (∞, au]. Dette and Volgushev (2008) [17] showed that in these cases the

rearrange-ments can be defined on an unbounded support and that the asymptotic behavior of the rearrangement is the same as described above. In this order, we can safely use monotone rearrangement in the estimation of the exponential-1 density function.

(28)

2. Results

2.1. Comparing Methods

The main point of this thesis is to investigate whether improving the kernel estimator by the reflection method, as introduced in section 1.4.1, before applying the method of monotone rearrangement, will result in a better estimate of a monotone target density. In the last section we saw the asymptotic properties of the monotone rearrangement of an estimate are the same as those of the original estimate. So since the reflection method produces an consistent estimate of a density function with a discontinuity in x = 0, applying monotone rearrangement to this estimate will also produce a consistent estimate, and it will share the same asymptotic properties. In practice however, one does not encounter samples of infinite size. Therefore it is valuable to look at the behaviour of MISE of the methods for samples of smaller sizes. To measure the performance of the methods, simulations have been run in Matlab with different sample sizes. The density we will look at, is the exponential-1 density we also used in section 1.4.1: f (x) = e−x₁[0,∞)(x).

2.2. Simulations

In the simulations ran in this thesis, we used a Sheather and Jones bandwidth selec-tor. This selector was originally designed by Sheather and Jones (1991) [11] for smooth densities. In our case, we applied it to the exponential-1 density, which has a discon-tinuity in x = 0. Van Es and Hoogstrate (1997) [18] showed that the Sheather and Jones bandwidth selection method adapts slightly to the non-smoothness of the target density. The bandwidths produced by this method are of smaller order than n−15 (1.26),

the optimal rate for smooth densities, but the optimal rates of n−12 (1.27) for densities

with discontinuities are not reached. However, they state that the Sheather and Jones bandwidth still performs well in smaller sample size applications, which is why we chose to use this bandwidth selection method. Therefore we used a Matlab script published by Dynare [19], ‘mh optimal bandwidth’, which we adapted slightly to fit our needs (e.g. removed options that were unnecessary in our application to speed up the selection pro-cess and reduce the amount of input variables). The kernel we used in the simulations is a gaussian kernel, for the Sheather and Jones bandwidth selection method requires an at least six times differentiable kernel and the Epanechnikov kernel is not continuously differentiable.

Simulations where run on samples of different sizes, n = 50, 100, 500, 1000. For each sample size, 1000 samples were generated to which the kernel estimation methods have

(29)

been applied. The estimated densities were used to approximate the MSE and MISE of the method, where the average of the function values was used to approximate the expectation of the estimator and the sample variance to approximate the variance of the estimator. Furthermore, to calculate the MISE from the MSE, the Composite Simpson’s Rule is used to approximate the integral from the numerical values.

From each sample, the density was estimated by a ‘pure’ kernel estimator as in def-inition 1.1 with a gaussian kernel (second column of table 2.1), the decreasing isotone monotone rearrangement of this kernel estimation (third column of table 2.1), the re-flection method applied to the sample (fourth column of table 2.1) and the decreasing isotone monotone rearrangement applied to the reflection method (fifth column of table 2.1). After the calculation of the estimate, the MSE and the MISE of the estimators was calculated. Hereby we have to note, that for the smaller sample size n = 50, the values of the MISE can vary when the simulations are repeated. For the larger sample sizes the mean integrated square errors are stable in repetition of the simulation.

We tried several ways to implement the monotone rearrangement algorithm. Imple-mentation literally following the definition in equation (1.37) turned out to be inaccurate if the simulations had to be run within reasonable calculation time. Another method of computing the rearranged estimate was proposed in Chernozhukov et al. (2009) [2], us-ing quantiles of the set of numerical function values as the rearrangement. This method proved to be far more accurate in terms of MISE, and required significantly less calcula-tion time. The most accurate and efficient method however, also proposed in the same paper [2], and in Anevski and Foug`eres (2008) [20], was simulating the rearrangement as a sorting operator. If the function values of ˆf , the function we want to apply the rear-rangement to, are calculated in a fine enough grid of equidistant points, the decreasing rearrangement can just be simulated by the sorting of these function values in decreasing order. Therefore, this is the method we used, to calculate the mean integrated square errors for the rearrangements.

For the reproducibility of these results, one can find the full Matlab code used for the simulations in Appendix B.4.

n MISE MISErearr MISErefl MISErefl+rearr

50 0.1235 0.0813 0.0232 0.0231 100 0.1065 0.0655 0.0156 0.0156 500 0.0706 0.0334 0.0052 0.0052 1000 0.0585 0.0243 0.0032 0.0032

Table 2.1.: The MISE for the different methods of improving the kernel estimation for samples of size n.

The mean square errors of the simulations for sample size n = 1000 are represented in figure 2.1, where it becomes clear that al methods have a comparable performance away from the boundary x = 0. The difference in the performance of these methods lies, as we would expect, close to the boundary. A close-up of this area is shown in the second

(30)

graph of figure 2.1.

Figure 2.1.: In the estimation of the exponential-1 density, the MSE of the ‘pure’ kernel estimator (solid line), its monotone rearrangement (dashed line), the reflec-tion method (dotted line) and its monotone rearrangement (dotted-dashed line). Note that the dotted line and the dotted-dashed line are indistin-guishable in these plots. The graph below is a close-up of the area close to the boundary x = 0.

(31)

3. Conclusion

As we saw from table 2.1 for each sample size n, applying decreasing rearrangement after applying the reflection method has no, or a very small effect. One could think the MISE is reduced in the simulation with sample size n = 50, but because the variation in the MISE for this sample size is relatively big, the small improvement is not signifi-cant. What does become clear from table 2.1 is that both the monotone rearrangement method and the reflection method significantly reduce the MISE. It also shows that in terms of the reduction of the MISE, the reflection method is preferred over the monotone rearrangement. However, in terms of computation time the monotone rearrangement al-gorithm performs better than the reflection method. The time necessary for rearranging a found estimate is negligible compared to the computation time necessary for kernel estimation. For example, for a sample of size n = 1000, unconstrained kernel estimation and monotone rearrangement takes roughly 0.5 seconds, while the reflection method takes somewhat more than 1 second to compute. Rearranging the reflected estimate takes negligible time, but on the other hand will not improve the estimate significantly. Therefore, for extremely large samples it might be more desirable to use the monotone rearrangement algorithm than to use the reflection method.

The conclusion we can draw from the data in table 2.1 is that combining the method of monotone rearrangement and the reflection method will not significantly improve the estimate relatively to just applying the reflection method.

(32)

4. Discussion and Review of the

Progress

Several remarks can be made about the simulations ran in this thesis. First of all, it should be stressed that the sample size of n = 50 is too small to draw any conclusions, for the MISE varies a lot when the estimation procedure is repeated on a different set of samples. Secondly, more monotone probability densities should be investigated to confirm the answer to the research question more firmly. And thirdly, the Sheather and Jones bandwidth selector we used to estimate the target density in this thesis is not optimal for discontinuous densities, as was stated in section 1.3. Therefore other band-width selection methods should be investigated to improve the density estimates. For example, for larger sample sizes, one could use the least squares cross-validation band-widths, which are discussed in Stone (1984) [21]. These bandwidths are asymptotically equivalent to the optimal bandwidths, even in the non-smooth cases [18]. This might be the reason why the accuracy of the unconstrained kernel estimator is less than that in [14], where a different bandwidth selector was used. Since the monotone rearranged estimator is based on this, less accurate, kernel estimator, the monotone rearrangement will also be less accurate, thus the mean integrated square errors will not be comparable to those from [14].

In this thesis, we started out with the introduction of kernel estimators and some basic properties of this method. Then we pointed out the boundary effects that occur when a target density with a discontinuity in x = 0 is estimated. As a solution to the loss of consistency, the reflection method was discussed. Monotone rearrangement, a method rearranging function values in ascending or decreasing order, is proposed in the case where a monotonic density is to be estimated. Simulations in Matlab were run, to compare the reflection method and the method of monotone rearrangement separately and applied together.

Review of the Progress

Taking on a project extending over such a long period of time was new for me, and I learned a lot in the progress. Looking back on the last few months, there are a lot of things I could have done better, but also a lot of things that worked out well. The most important lesson I learned in these months, is that I should not hesitate to approach my supervisor. While reading the articles I used for this thesis, there were moments where I was struggling a lot with minor details, which in some cases cost me days. However, a simple explanation by my supervisor, sometimes just a clarification of a definition, was enough to get me back on track.

(33)

mono-tone rearrangement by Dr. Van Es, I started familiarizing myself with the subject. There were a lot of books at hand with a proper introduction to kernel estimation to get acquainted with this principle. However, since the method of monotone rearrangement was only recently introduced in the subject of statistics, I had to get all the information on this subject out of articles that are not written for undergraduate students who are new to this field of research. This made it harder to fully understand the method and work with it. Therefore it took me more time than I had expected, especially because most of the time while I was working on this project, I still had to follow courses, so I had to divide my attention. Only when the courses ended, this project could finally get my full attention, and in those weeks the real progress was made. Furthermore, simulating in Matlab was relatively new to me, but fortunately enough it did not took me long to gain experience in the programming and I really enjoyed it.

Personally, I found the conclusion drawn rather surprising, for I had expected that monotone rearrangement would perform better in combination with the reflection method. If I had had more time for this project, I would have liked to find out why the combined performance of these methods is relatively poor. Also, I would have liked to take a look at another method of nonparametric density estimation, the method of nonparametric maximum likelihood estimators (NPMLE), and measure its performance relatively to the kernel estimators. On top of that, I would have liked to learn more about asymp-totic statistics to find out more about the asympasymp-totic properties of different estimation methods. Luckily, I will get this chance next year during my masters.

To conclude, I would like to thank Dr. Van Es for introducing me to this subject and his support during the process of writing this thesis.

(34)

5. Populaire Samenvatting

Het onderwerp van deze bachelorscriptie is het schatten van kansdichtheden en verschil-lende methoden om een dergelijke schatting te verbeteren. Om dat volledig te kunnen begrijpen, is het uiteraard een vereiste te weten wat een kansdichtheid is.

Gemakkelijk gezegd, is een kansdichtheid een functie die aan elke gebeurtenis de kans van zijn voorkomen toekent. In het geval dat er eindig veel mogelijkheden zijn, is dit zelfs de definitie van een kansdichtheid. Om alles te verduidelijken zullen we de begrippen verder verkennen met behulp van een voorbeeld. Stel we zouden van een bepaalde geiser de lengte van een eruptie willen kunnen voorspellen. Om dat te kunnen doen, gaan we een maand lang met een stopwatch naast de geiser staan, en elke keer dat hij uitbarst meten we hoeveel minuten de uitbarsting duurt. Aan het eind van de maand hebben we, zeg 50 metingen gedaan (we zullen de eerste helft van de dataset uit Appendix C.2 gebruiken om deze gegevens te simuleren). Een veel gebruikte manier om die data weer te geven is met behulp van een histogram. We delen de tijd op in balkjes en als de lengte van een eruptie binnen een balkje valt, maken we het balkje iets hoger. Zo krijgen we uiteindelijk een histogram waarin de hoogte van elk balkje het relatieve voorkomen van de eruptielengte is, zoals te zien is in het linker deel van figuur 5.1.

Figure 5.1.: Links: histogram van de eruptielengtes (eerste helft data Appendix C.2). Rechts: schatting van de kansdichtheid op basis van het histogram.

Als we nu de balkjes wegdenken, alleen de bovenkant nemen en dit beschouwen als een curve, zoals gedaan is in de rechterkant van figuur 5.1, zouden we dit kunnen in-terpreteren als een schatting van de kansdichtheid. De oppervlakte onder een interval

(35)

representeert dan de kans dat de lengte van de eruptie binnen dat interval valt.

Wiskundigen zijn echter om meerdere redenen niet helemaal tevreden met een derge-lijke schatting van de kansdichtheid. Zo is het feit dat er allerlei sprongen in de dichtheid zitten niet bevredigend. Liever zouden we een gladde curve krijgen. Een methode om dit voor elkaar te krijgen is de zogenaamde kernschatter methode. In plaats van dat we de datapunten in een balkje plaatsen, plaatsen we een curve over elk datapunt, die een gewicht toekent aan het punt en het gebied eromheen. Het optellen van al deze bolletjes geeft ons dan een schatting van de kansdichtheid. Als we dit toepassen op onze dataset, resulteert dat in figuur 5.2. In het linker deel hebben we de schatting gemaakt op basis van 4 metingen en de individuele curves voor de datapunten weergegeven, in de rechter figuur is de schatting gemaakt met behulp van alle metingen. Ook hebben we de werkelijke kansdichtheid waarmee onze data verdeeld is (de exponentieel-1 dichtheid), in de figuur geplot om te hoe goed deze benaderd wordt. We zien dat deze methode, in tegenstelling tot het histogram, wel een mooie, gladde curve oplevert, die in alle gevallen dichter bij de werkelijke kansdichtheid zal liggen dan onze histogramschatting. Om deze schatting te optimaliseren, kan er nog veel winst behaald worden in de keuze van de breedte van de individuele curves, maar op dit moment is deze schatting illustratief genoeg.

Figure 5.2.: Links: de kernschatting van de kansdichtheid met behulp van 4 metingen. Rechts: de kernschatting van de kansdichtheid op basis van alle data (door-getrokken lijn), en de werkelijke kansdichtheid (gestreepte lijn).

Stel nu dat we de extra informatie hebben dat de kans op korte uitbarstingen altijd groter is dan de kans op lange uitbarstingen. Als we dan een schatting maken van de kansdichtheid van de lengte van de uitbarstingen, zouden we deze informatie natuurlijk graag mee willen nemen om onze schatting te verbeteren. De relatief simpele, maar toch pas recent ontdekte methode in dit vakgebied, is een methode die in feite neerkomt op het herschikken van de functiewaarden van de rechterkant van figuur 5.2. De methode gaat als volgt: we berekenen in heel veel equidistante punten op de x-as de functiewaarde

(36)

van de kernschatter uit. Vervolgens ordenen we die functiewaarden van hoog naar laag. De op deze manier verkregen functie is onze nieuwe schatting. In ons geval levert dit figuur 5.3 op, waar we de werkelijke kansdichtheid ook in de figuur geplot hebben.

Figure 5.3.: De monotone herschikking van de schatting uit de rechterkant van figuur 5.2 (doorgetrokken lijn), en de werkelijke kansdichtheid (gestreepte lijn).

Op het oog ziet deze schatting er al veelbelovend uit, en deze methode van monotone herschikking blijkt theoretisch over veelbelovende eigenschappen te beschikken. In deze bachelorscriptie wordt er, naast deze methode, nog een andere methode om een schatting met kernschatters te verbeteren behandeld. Daarna wordt onderzocht of het combineren van de twee methodes de schatting significant verbeterd. Na het doen van enkele simu-laties, concluderen we echter dat dit niet het geval is.

(37)

Bibliography

[1] E. F. Schuster, Incorporating support constraints into nonparametric estimators of densities, Communications in Statistics - Theory and Methods, 14, 5, 1123-1136, 1985.

[2] V. Chernozhukov, I. Fern´andez-val and A. Galichon, Improving Point and Interval Estimators of Monotone Functions by Rearrangement, Biometrika, 96, 3, 559-575, 2009.

[3] G. H. Hardy, J. E. Littlewood and G. P´olya, Inequalities, Cambridge University Press, 1952.

[4] V. Epanechnikov, Nonparametric Estimation of a Multidimensional Probability Density, Theory of Probability and its Applications, 14, 153-158, 1966.

[5] B. W. Silverman, Density Estimation for Statistics and Data Analysis, Chapman and Hall, 1986.

[6] B. L. S. Prakasa Rao, Nonparametric Functional Estimation, New York: Academic Press, 1983.

[7] P. Whittle, On the Smoothing of Probability Density Functions, Journal of the Royal Statistical Society. Series B, 20, 2, 334-343, 1958.

[8] M.P. Wand, M.C. Jones, Kernel Smoothing, Chapman and Hall, 1995.

[9] E. Parzen, On Estimation of a Probability Density Function and Mode, The Annals of Mathematical Statistics, 33, 1065-1076, 1962.

[10] C. Van Eeden, Mean Integrated Squared Error of Kernel Estimators when the Den-sity and its Derivatives are not Necessarily Continuous, Annals of the Institute of Mathematical Statistics, 37, A, 461-472, 1985.

[11] S. J. Sheather, M. C. Jones, A Reliable Data-Based Bandwidth Selection Method for Kernel Density Estimation, Journal of the Royal Statistical Society. Series B, 53, 683-690, 1991.

[12] I. Horov´a, J. Kol´aˇcek, J. Zelinka, Kernel Smoothing in MATLAB, World Scientific, 2012.

(38)

[13] R. J. Karunamuni, T. Alberts, A Locally Adaptive Transformation Method of Boundary Correction in Kernel Density Estimation, Journal of Statistical Plan-ning and Inference, 136, 5, 2936-2960, 2006.

[14] M. Birke, Shape Constrained Kernel Density Estimation, Journal of Statistical Plan-ning and Inference, 139, 8, 2851-2862, 2009.

[15] N. Neumeyer, A note on uniform consistency of monotone function estimators, Statist. Probab. lett., 77, 693-703, 2007.

[16] C. Bennett, R. C. Sharpley, Interpolation of Operators, Academic Press New York, 1988.

[17] H. Dette, S. Volgushev, Non-crossing nonparametric estimates of quantile curves, Journal of the Royal Statistical Society. Series B, 70, 3, 609-627, 2008.

[18] A. J. Van Es, A. J. Hoogstrate, How much do plug-in bandwidth selectors adapt to non-smoothness?, Journal of Nonparametric Statistics, 8, 2, 185-197, 1997.

[19] http://www.dynare.org/dynare-matlab-m2html/matlab/mh_optimal_ bandwidth.html

[20] D. Anevski, A. Foug`eres, Limit Properties of the Monotone Rearrangement for Density and Regression Function Estimation, arXiv:0710.4617 [math.ST], preprint (2008).

[21] C. J. Stone, An Asymptotically Optimal Window Selection Rule for Kernel Density Estimates, Annals of Statistics, 12, 4, 1285-1297, 1984.

[22] D. Pollard, A User’s Guide to Measure Theoretic Probability, Cambridge University Press, 2002.

(39)

A. Theorems and Proofs

The following theorem as stated in Birke (2009) [14], proves that the smoothed re-arrangements g_h∗

d and g

∗

hd of an unconstrained probability density are still probability

densities, and converge pointwise to the isotone rearrangements g∗ and g∗ respectively. We will here elaborate on the proof from Birke, to make it readable for a third year bachelor mathematics student. The theorem below states the result for the increasing rearrangement. The proof for the decreasing case is similar, and therefore omitted. Theorem A.1. Let D be the set of all probability densities on A, a compact interval in R. For any unconstrained continuous probability density g ∈ D, the smoothed rearrangement g_h∗

d of g converges pointwise to g

∗ _{∈ D, if h} d→ 0.

Proof. We will first show that Fg,hd(t) converges to Fg(t) as hd → 0. Without loss of

generality, we choose A = [0, 1]. Then, from the definition of Fg,hd(t),

Fg,hd(t) := 1 hd Z 1 0 Z t −∞ Kd g(u) − v hd dvdu,

because the kernel Kd is 0 outside [−1, 1],

Kd

g(u) − v hd

= 0 if v ≤ g(u) − hd,

we can change the lower bound in the second integral to g(u) − hd and because the

second integral is zero if g(u) > t + hd, we get

1 hd Z 1 0 Z t −∞ Kd g(u) − v hd dvdu = 1 hd Z 1 0 1 {g(u) ≤ t + hd} Z t g(u)−hd Kd g(u) − v hd dvdu = Z 1 0 1 {g(u) ≤ t + hd} Z 1 (g(u)−t)/hd Kd(z)dzdu, (A.1)

by change of variable v = g(u) − hdz. Now considering that Kd integrates to unity

(40)

(g(u) − t)/hd> −1, we can write Z 1 0 1 {g(u) ≤ t + hd} Z 1 (g(u)−t)/hd Kd(z)dzdu, = Z 1 0 1 {g(u) ≤ t + hd} 1 {g(u) ≤ t − hd} du + Z 1 0 1 {t − hd≤ g(u) ≤ t + hd} Z 1 (g(u)−t)/hd Kd(z)dzdu = Z 1 0 1 {g(u) ≤ t − hd} du + Z 1 0 1 {t − hd ≤ g(u) ≤ t + hd} Z 1 (g(u)−t)/hd Kd(z)dzdu. (A.2) Thus, by the definition of Fg(t),

Fg(t) := Z 1 0 1 {g(u) ≤ t} du, we get |Fg(t) − Fg,hd(t)| = Z 1 0 1 {t − hd≤ g(u) ≤ t} du − Z 1 0 1 {t − hd≤ g(u) ≤ t + hd} Z 1 (g(u)−t)/hd Kd(z)dzdu ≤ Z 1 0 1 {t − hd≤ g(u) ≤ t} du + Z 1 0 1 {t − hd≤ g(u) ≤ t + hd} du −→ 0,

as hd → 0, using the uniform continuity of g. Since g∗ and gh∗d are the generalized

inverses of Fg and Fg,hd respectively, because of the continuity of the functional which

maps a function onto its inverse in a fixed point x, this implies

g_h∗_d(x) −→ g∗(x) as hd−→ 0. (A.3)

So we showed the pointwise convergence. We now have to show that g∗ is still a density to conclude the proof. Since by definition of monotone rearrangement, g∗only rearranges the function values in increasing order and g ≥ 0, we find g∗ ≥ 0. Furthermore, because monotone rearrangement of a function preserves the Lp_{-norm, as showed by Bennett}

and Sharpley (1988) [16], taking p = 1, we get

Z A g∗(x)dx = kg∗k1 = kgk1 = Z A g(x)dx = 1. (A.4) Thus we conclude that g∗ ∈ D, which proves the theorem.

(41)

The following theorem, as described by Chernozhukov (2009) [2], implies that the increasing rearrangement as defined in (1.35) will be a better approximation of the target function in the standard norm on Lp: for a measurable function f : A → K, kf kp =

R

A|f (x)| p_dx1_p

. To prove the theorem, we will first need the following definition. Definition A.2. A function L : Rk _{→ R is called submodular if}

L(x ↑ y) + L(x ↓ y) ≤ L(x) + L(y), (A.5) for all x, y ∈ Rk_{, where x ↑ y denotes the componentwise maximum and x ↓ y denotes}

the componentwise minimum of x and y.

Below, we will formulate the theorem by Chernozhukov and elaborate on the proof. The proof is dealing with the increasing rearrangement, but for the decreasing case the proof is similar and therefore omitted.

Theorem A.3. Let A be a compact interval in R and the target function f : A → K be a weakly increasing measurable function in x, where K is a bounded subset of R. And let ˆf : A → K be a measurable estimate of the target function.

1. For any p ∈ [1, ∞), applying monotone rearrangement to ˆf , to get ˆf∗, weakly reduces the estimation error:

k ˆf∗(x) − f (x)kp ≤ k ˆf (x) − f (x)kp. (A.6)

2. If there exist regions A1 and A2, with measure greater than δ > 0, such that for all

x ∈ A0 and x0 ∈ A1: (i) x0 > x, (ii) ˆf (x) > ˆf (x0) + , and (iii) f (x0) > f (x) + ,

for some > 0. Then, for any p ∈ [1, ∞):

k ˆf∗(x) − f (x)kp ≤

h

k ˆf (x) − f (x)kp_p− δηp

i1_p

, (A.7)

where ηp = inf {|v − t0|p+ |v0− t|p− |v − t|p− |v0− t0|p} and ηp > 0 for p ∈

(1, ∞), with the infimum taken over all v, v0, t, t0 in the set K such that v0 ≥ v + and t0 ≥ t + .

Proof. We will start by proving the first part of the theorem, considering simple functions at first, using the dominated convergence theorem to prove the general case. Suppose both the estimate ˆf (·) and the target function f (·) are simple functions that are constant on the intervals (s−1_r ,1_r], s = 1, . . . , r. Each simple function g(·) of this form can be written as the step function g(·) =Pr

s=1gs1(s−1_r ,s_r](·). Now we can define the r-vector g

as the vector with values gs, the value g(·) takes on the sth interval, on position s. From

this definition we see that each r-vector corresponds to a step function.

We will now define the sorting operator S acting on the vector g as follows. Let l be an integer in 1, . . . , r such that gl > gm for some m > l. If l exists, set S(g) to be the

r-vector with gm on the lth position and gl on the mth position, and all other elements

(42)

For any submodular function L: R2 → R+, because gl ≥ gm and, by the fact that f

is weakly increasing, fm ≥ fl, we get

L(gm, fl) + L(gl, fm) ≤ L(gl, fl) + L(gm, fm). (A.8)

Therefore, taking into consideration that the L of simple functions is still a simple function, Z A LS( ˆf )(x), f (x)dx = r X s=1 LS( ˆf )s, fs (A.9) ≤ r X s=1 L ˆfs, fs = Z A L ˆf (x), f (x)dx (A.10)

where the inequality follows from the submodularity of L. If we apply the sorting operator S sufficiently many times to ˆf , to a maximum of r times, we find the rearranged vector ˆf∗: a vector completely sorted in ascending order. Every application of the sorting operator S reduces the size of the integral, and since we apply it a finite number of times we get Z A L ˆf∗(x), f (x)dx = Z A LS ◦ . . . ◦ S( ˆf )(x), f (x)dx ≤ Z A L ˆf (x), f (x)dx (A.11)

If we now notice L(x, y) = |x − y|p _{is submodular for p ∈ [1, ∞), we see that this implies}

that k ˆf∗− f kp ≤ k ˆf − f kp. Thus, for simple ˆf and f the rearranged estimate will have

an approximation error that is smaller or equal to the approximation error of the original estimate. We will now extend this inequality to the general case.

Let ˆf (·) and f (·) be measurable functions mapping A to K, then there exists a se-quence of bounded simple functions ˆf(r)_{(·) and f}(r)_{(·) converging to ˆ}_{f (·) and f (·) a.e.}

as r → ∞ and taking values in K [22]. Because of the definition of the rearrangements as quantile functions, the almost everywhere convergence of ˆf(r)(·) to ˆf (·) and f(r)(·) to f (·) implies the almost everywhere convergence of their rearrangements ˆf∗(r)(·) to the rearrangement of the limit ˆf∗(·), and f∗(r)(·) to f∗(·) = f (·) [23], [16]. Note that the last equality is because for the weakly increasing function f , the rearrangement f∗ equals f . Since inequality (A.11) holds along the sequence, the dominated convergence theory implies that it also holds for the general case. This concludes the proof of part 1 of the theorem.

We will now move on to part 2 of the theorem. We will start out the same way as in the proof for part 1, with simple functions ˆf and f . We take this functions to satisfy the conditions from the theorem, that is: there exist regions A1 and A2, with measure

(43)

and (iii) f (x0) > f (x) + , for some > 0. Then, for any strictly submodular function L: R2 _{→ R}

+, we find

η = inf {L(v0, t) + L(v, t0) − L(v, t) − L(v0, t0)} > 0, (A.12) with the infimum taken over all v, v0, t, t0 in the set K such that v0 ≥ v + and t0 _{≥ t + .}

We will begin the sorting by exchanging the elements ˆf (x), x ∈ A0, of the r-vector ˆf

with the elements ˆf (x0), x ∈ A1, of ˆf , for as long this is possible. For each point sorted

this way, this will induce a sorting gain of at least η, from the strict submodularity by equation (A.12), times 1/r, the length of the interval where ˆf (x) has the value exchanged. The total mass of points that can be sorted this way is at least δ, for both regions are bigger than δ. So if we sort all these points in this way, the gain is at least δη. If we then continue to sort the other points with the sorting operator we used in the proof of part 1, we get the following inequality,

Z A L ˆf∗(x), f (x)dx ≤ Z A L ˆf (x), f (x)dx − δη. (A.13)

In exactly the same way as in the proof of part 1, we can extend this to the case of general measurable function. This concludes the proof of the theorem.

(44)

B. MATLAB code

Our target density, the exponential-1 density function [f]= exppdf(lambda,grid) l = length(grid);

f = zeros(1,l); for i=1:l;

if grid(i) >= 0;

f(i) = lambda * exp(-lambda*grid(i)); end

end end

B.1. Epanechnikov Kernel Estimator

The implementation of the Epanechnikov kernel estimator. function f = Epan(x) f=0; if abs(x)<=1; f=0.75*(1-x^2); end end function y = EpKernel(datapoint,h,grid) n=length(grid); y=zeros(1,n); for i=1:n; y(i)=Epan((grid(i)-datapoint)/h); end

(45)

end

function [estimate] = kernestEp(data,grid,h) l = length(grid);

n = length(data); estimate = zeros(1,l);

for k=1:n;

estimate = estimate + (EpKernel(data(k),h,grid)/(n*h)); end

end

The reflection method for the Epanechnikov kernel estimator. function [estimate] = kernestmirrorEp(data,grid,h) l = length(grid);

n = length(data); estimate = zeros(1,l); %reflection of the data datas = zeros(2*n); for j = 1:n; datas(j) = -data(j); datas(n + j) = data(j); end for k=1:2*n;

estimate = estimate + (EpKernel(datas(k),h,grid)/(n*h)); end for z = 1:l; if grid(z) < 0; estimate(z) = 0; else break end end end

(46)

B.2. Gaussian Kernel Estimator

The implementation of the gaussian kernel estimator. function f = npdf(x) f=1/(sqrt(2*pi))*exp(-(x.^2)/2); end function y = NorKernel(datapoint,h,grid) l=length(grid); y=zeros(1,l); for i=1:l; y(i)=npdf((grid(i)-datapoint)/h); end end

function [estimate] = kernestNor(data,grid,h) l = length(grid);

for k=1:n;

estimate = estimate + (NorKernel(data(k),h,grid)/(n*h)); end

end

The reflection method for the gaussian kernel estimator. function [estimate] = kernestmirrorNor(data,grid,h) l = length(grid);

%reflection of the data datas = zeros(2*n);

(47)

for j = 1:n;

datas(j) = -data(j); datas(n + j) = data(j); end

for k = 1:2*n;

estimate = estimate + (NorKernel(datas(k),h,grid)/(n*h)); end for z = 1:l; if grid(z) < 0; estimate(z) = 0; else break end end end

B.3. Montone Rearrangement

The implementation of the decreasing isotone monotone rearrangement. function [rearr]=mon_rearr(fx)

y = sort(-fx); rearr = -y;

end

B.4. Simulations

Simulations en calculations of MSE and MISE. %set number of grid intervals

l = 600; %sample size n = 500; %number of samples p = 1000; %sample parameter

(48)

lambda = 1; %begin interval a = -2; %end interval b = 7; %setting grid x = linspace(a,b,l+1);

%creating grid of the positive x x_0 = []; for i = 1:l+1; if x(i)>=0; x_0 = [x_0,x(i)]; end end

%distance between points d = (b-a)/l;

%target distribution f = exppdf(lambda,x); %pure kernel estimation estimate = zeros(p,l+1); %monotone rearrangement estimate_mon = zeros(p,l+1); %reflection method

estimate_ref = zeros(p,l+1);

%reflection and monotone rearrangement estimate_ref_mon = zeros(p,l+1);

for i = 1:p

%generating sample

sample = exprnd(lambda,1,n);

%bandwidth selection

(49)

%pure kernel estimation %kernel estimation f_i = kernestNor(sample,x,h); %storing estimate estimate(i,:) = f_i; %monotone rearrangement

%applying monotone rearrangement fm_i = mon_rearr(f_i); fm_i = fm_i(1:length(x_0)); %storing estimate estimate_mon(i,l+1-length(x_0)+1:end) = fm_i; %reflection method %kernel estimation fr_i = kernestmirrorNor(sample,x,h); %storing estimate estimate_ref(i,:) = fr_i;

%reflection and monotone rearrangement

%applying monotone rearrangement frm_i = mon_rearr(fr_i); frm_i = frm_i(1:length(x_0)); %storing estimate estimate_ref_mon(i,l+1-length(x_0)+1:end) = frm_i; end %summing rows cumsum = sum(estimate,1); cumsum_mon = sum(estimate_mon,1); cumsum_ref = sum(estimate_ref,1); cumsum_ref_mon = sum(estimate_ref_mon,1);

Nonparametric Density Estimation and Monotone Rearrangement