Bandwidth Selection by Method of Moments Estimation in Kernel- and Spline-based Smoothing

(1)

Bandwidth Selection by Method of Moments

Estimation in Kernel- and Spline-based Smoothing

by Henny Holtman BSc.

A thesis presented for the degree of

Master of Science in Econometrics

written under the supervision of prof. dr. P.A. Bekker

Faculty of Economics and Business Department of Econometrics

University of Groningen The Netherlands

(2)

(3)

Bandwidth Selection by Method of Moments

Estimation in Kernel- and Spline-based Smoothing

by Henny Holtman BSc.

Abstract

Due to advances in statistics and computing, methods for non-parametric regression have become readily available and in Green and Silverman (1993) these methods are considered to be serious alternatives to more traditional parametric regressions. In this paper we discuss both spline and kernel regression and show in detail how one can obtain both an estimated non-parametric regression function and the corresponding point-wise confidence interval. In the regressions a bandwidth parameter has to be specified and since a trial and error method is both subjective and time consuming, one generally relies on a data-driven band-width selection method. One of the most well-known bandband-width selection methods is (gen-eralized) cross-validation where the objective is to minimize the expected prediction error. In this thesis we introduce a new bandwidth selection method which is of a method of mo-ments (MM) type. We investigated the performance of both bandwidth selection methods in spline and kernel regression by means of a Monte Carlo study. We find that in simple bivari-ate regressions generalized cross-validation often outperforms the MM-estimator. However we find that the MM-estimator shows very good results when considering higher dimen-sional problems compared to the more traditional method of generalized cross-validation.

(4)

(5)

Preface

This Master’s thesis is the result of the research conducted during May 2010 to January 2011 and is written in order to obtain the Master’s degree in Econometrics, Operations Research and Actuarial Studies at the University of Groningen, The Netherlands. Since the comple-tion of the Master’s program also marks the end of my studies here in Groningen I would like to use this opportunity to express my gratitude towards a number of people.

First of all I want to thank professor Paul Bekker for his invaluable help throughout the entire project. His enthusiasm for the subject was very inspiring and his comments and ideas were of great help. Moreover, I enjoyed our discussions on the subject and above all I appreciated the thoroughness with which he reviewed my work.

Also I would like to thank the entire staff of the Econometrics department of the University of Groningen. I appreciated your effort in teaching me about all kinds of interesting subjects and greatly value the fact that many of you were willing to help during and even outside the lectures. Additionally I render thanks to Laura Spierdijk for being the co-assessor and I thank Arne Wolters for all his useful comments on this thesis.

Due to my fellow students and friends here in Groningen I had a great time and hereby I also would like to thank all of you for the great moments we have had. Finally I would like to thank my parents and my wife Gerieke who all have been very supportive during my studies. In successfully completing both the Bachelor and Master’s program of Econometrics in Groningen, they all have been an indispensable factor.

Henny Holtman

Groningen, January 2011

(6)

(7)

Introduction

In many fields of economics the relationship between a pair of variables is of great interest, think of price and demand, education and income, etc. Although these bivariate relation-ships may not adequately describe all real-world processes, the mathematical and statistical tools that are used for analyzing them are the basic building blocks for the more compli-cated problems. Bivariate relationships can be examined using “standard” parametric meth-ods, but every parametric method restricts the estimated regression function to be of some known function of prespecified functional form. In non-parametric regression the data is allowed to “speak for itself” and thereby it determines the functional form of the regression function which is only required to be smooth.

One of the most well known non-parametric methods is kernel regression where the obser-vations are weighted using a so-called “kernel function”. Apart from the choice of the kernel function one also has to specify a so-called “bandwidth parameter” which determines the smoothness of the estimated regression function. From the literature it is known that the choice of the bandwidth is far more important than the choice of the kernel function, see amongst others Racine (2008). Another non-parametric method is spline regression where a continuous piecewise polynomial is fitted to the data.

The problem of choosing a suitable bandwidth parameter or “smoothing parameter” is in-evitable in any type of non-parametric regression and various methods for doing so exist. One of the most well-known bandwidth selection methods is (generalized) cross-validation

(10)

2 CHAPTER 1. INTRODUCTION

which focuses on the mean prediction error. However it has been shown in Faraway (2005) that generalized cross-validation may perform badly in the presence of outliers. In this paper we introduce an alternative estimator for bandwidth selection in non-parametric re-gression which is of a method-of-moments (MM) type and focuses on the estimated residual variance. We answer the following research question:

How does the performance of the MM-estimator compare to that of the standard general-ized cross-validation in both spline and kernel regression, particularly considering normal distributed and fat-tailed distributions for the error term with various signal-to-noise ra-tios?

The signal-to-noise ratio is a term that stems from science and engineering and it quantifies how much a signal has been corrupted by noise. We will loosely use this term to indicate the relative size of disturbances.

In order to answer the above research question we will start with a discussion on the non-parametric methods itself, as we need to fully understand these methods in order to apply data-driven bandwidth selection methods. In chapter 2 we show how one can perform a first- and second-order spline regression and obtain point-wise confidence intervals. Then in chapter 3 we consider both local constant and local linear kernel regression.

The discussion on bandwidth selection starts in chapter 4 where we first consider (gener-alized) cross-validation and thereafter we introduce our MM-estimator. The performance of both generalized cross-validation and the MM-estimator is compared in a Monte Carlo study which can be found in chapter 5. Here we consider test functions that have been used in non-parametric literature before and study various signal-to-noise ratios. We will also consider more than one distribution for the disturbances and use various measures for performance evaluation. In the study we will use both generalized cross-validation and the MM-estimator in first and second-order spline regression, local constant kernel regression and local linear kernel regression.

(11)

Chapter 2

Spline Regression

2.1 Introduction

Although this thesis is concerned with curve estimation using non-parametric techniques, we will start with briefly discussing ordinary least squares regression. This estimation method is well-known and the reader may find it helpful to get acquainted with the frame-work presented here as it can be easily transfered to the non-parametric estimation methods which are discussed later on.

Consider now the simple bivariate model that will provide us with the convenient frame-work for the discussion of regression analysis. Suppose that we have n observations from a continuous variable y, taken at predetermined values of the independent variable x. Let {_x_i_{, y}_i} _{with i = 1, . . . , n be the corresponding values of x and y and assume that they are} related according the following model

yi= m(xi) + i, i = 1, . . . , n, (2.1)

where m is called the “regression function” and the i are uncorrelated random variables with zero mean and common variance σ2. The purpose of regression analysis in general is now to estimate the unknown function m.

One of the most well-known and powerful estimating principles emerged in the early years of the nineteenth century and is known as “least squares regression”. The interested reader

(12)

4 CHAPTER 2. SPLINE REGRESSION

may want to consult Stigler (1986) for a comprehensive history on the subject. In ordinary least squares regression one assumes that m is a linear function and we estimate the follow-ing model

y =αι + βx + , (2.2)

where we now use vector notation. Here we define y = (y1, . . . , yn)T and we use a similar notation for the vectors x and . The vector ι denotes the n-vector of ones and the parameters α and β in (2.2) ought to be estimated.

Although linear regression is very popular, the reader should observe that we implicitly as-sume the true relationship between x and y to be a linear one. This linearity assumption is quite restrictive since the relationship between the variables x and y may be very non-linear. By means of higher order polynomials and logarithms the framework can be adapted to allow for some curvature, but in any case, the researcher restricts the estimated regres-sion function to be some known function of prespecified restricted functional form whose parameters have to be estimated, see also DiNardo and Tobias (2001).

2.1.1 Non-parametric Regression

Unlike least squares regression, in non-parametric regression we allow the data to “speak for itself”, i.e. the data is given the flexibility to determine its own functional form of the estimated regression function. The objective in non-parametric regression is thus to estimate the unknown function m in (2.1) directly, rather than to estimate some parameters. In most non-parametric methods it is assumed that m is a continuous and a smooth function of the independent variable x

(13)

2.2. FIRST-ORDER SPLINE REGRESSION 5

2.2 First-order Spline Regression

Consider again the model in (2.1) and imagine what the result would be if we were to es-timate this model by minimizing the least-squares criterion without any restriction on the functional form of the estimated regression function. In that case we minimize

n X

i=1

(yi−h(xi))2,

where h is the estimated regression function. It is immediate that the residual sum of squares, which is a standard measure for goodness of fit, can be minimized to zero by any function that interpolates the data. See for example figure 2.1 where we used simulated data from Härdle (1991) and linear interpolation.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0 0.5 1 1.5 2 x y

Figure 2.1–Artificial data example A from Härdle (1991) showing varying amounts of curvature, two extrema and an inflexion point. The dashed black line represents the known underlying function and the data points are indicated by a cross. When minimizing the total sum of squares to zero, this leads to an interpolation of the data as is shown by the solid blue line.

(14)

Before we continue, we first need to quantify the notion of the roughness of a curve. Then, for a candidate estimator h, we can add this “roughness penalty” to the residual sum of squares and obtain an overall assessment of its quality, see amongst others Eubank (1999). This form of spline regression is due to Schoenberg (1964) and Reinsch (1967), but the basic idea of adding a roughness penalty to a goodness of fit measure was already formulated in Whittaker (1923) where this smoothing process was called a “graduation” or “adjustment” of the observations, see also Härdle (1994).

Following Green and Silverman (1993) we will now assume that the estimated regression function h is a continuous and differentiable function on the interval [a,b] and we define the roughness penalty as its integrated squared derivative, i.e.

Zb a

(h0(x))2dx.

The overall quality of a candidate estimator h is now measured by the so-called “penalized sum of squares” which is defined for any differentiable function h on [a,b] and is given by

Sλ(h) = n X i=1 (yi−h(xi))2+ λ Z b a (h0(x))2dx. (2.3)

The parameter λ > 0 in (2.3) governs the trade-off between the goodness of fit, as measured by the sum of the squared deviations between yiand h(xi), and the smoothness of h which is measured by the roughness penalty. The larger the value for λ, the more premium is being placed on the smoothness of the curve estimate and potential estimators with too much rapid local variation are penalized, that is why we will refer to λ as the “smoothing parameter”.

(15)

2.2. FIRST-ORDER SPLINE REGRESSION 7

Definition 2.1 A sparse matrix whose non-zero entries are confined to some small number of diagonals is called a band matrix. We say that a matrix M has lower bandwidth k1if mij = 0 whenever i > j + k1and upper bandwidth k2if j > i + k2implies mij = 0, see also Golub and Van Loan (1996).

The number of non-zero diagonals in M is given by k₁+ k₂+ 1and is called the bandwidth of the matrix, if M is symmetric then k1= k2.

Without loss of generality we assume x1 ≤x2 ≤ . . . ≤ xn and we define ∆i = √

xi+1−xi for i = 1, . . . , n − 1 which is the square root of the first differences of x and let ∆ be the diagonal matrix of size n − 1 with elements ∆1, . . . , ∆n−1. We also define the band matrix F of size n × n − 1 as follows F =                               −₁ ₀ · · · ₀ 1 −₁ . .. ... 0 1 . .. 0 .. . . .. ... −1 0 · · · ₀ ₁                               .

Now we can define the band matrix K, which can be easily shown to have a bandwidth equal to 3, as follows

K = F ∆−2FT,

where we define ∆−2to be the diagonal matrix of size n − 1 with elements ∆−_i2, i = 1, . . . , n − 1. Then, following Green and Silverman (1993) we can use the following matrix representation of the penalized sum of squares in (2.3)

Sλ(h) = (y − h)T(y − h) + λhTKh, (2.4)

where h = (h(x1), . . . , h(xn))T, the elements of the estimated regression function h evaluated at the values x1, . . . , xn. Differentiation of (2.4) with respect to the vector h and equating the result to zero gives us the following expression for the regression function

h = Mλy, (2.5)

where Mλ= (I + λF ∆

−₂

(16)

rest of this paper we will generally not report the size of identity matrices since this should follow immediately from the context.

The estimated regression function may now be obtained by using the Cholesky decompo-sition that we will use in case of second-order spline regression, which is to be discussed in section 2.3, but using the following theorem and the LU-decomposition may be a more efficient approach.

Theorem 2.1 If W is a square matrix of size n, X is a n × m-matrix, Y is a square matrix of size m and Z is a m × n matrix, then if W and Y are nonsingular

(W + XY Z )−1= W−1− W−1XZ W−1X + Y−1−1Z W−1.

A proof of theorem 2.1 may be found in appendix B. Using the above theorem and defining W as the identity matrix of sizen, X = F , Z = FT and Y = λ∆−2, we find that the difference between y and the values of the estimated regression function h is given by

y − h = Fλ−1∆2+ FTF−1FTy.

A side result of applying Theorem 2.1 is that it provides us with a nice solution to the prob-lem of identical values for the independent vector. Whereas in (2.5) identical values would lead to a division by zero, now we have removed the need to compute the reciprocal of the squared diagonal elements of ∆. Therefore, in first-order spline regression, multiple values of the independent variable are allowed without further adaption.

The unknown differences y − h can be obtained by using the LU-decomposition which is described in Atkinson (1989). Here we suppose that e and f are n-vectors with unknown elements ei and known elements fi respectively and D is some known band matrix with 3 diagonals satisfying e = D−1f . Then the LU-decomposition computes the unknown vector e as a function of f and the main diagonal of D as follows

(17)

2.3. SECOND-ORDER SPLINE REGRESSION 9

LU-decomposition

Step 1 : β1= d1, βi= di−1/βi−1 for i = 2, . . . , n Step 2 : γ1= f1, γi= (fi+ γi−1) /βi for i = 2, . . . , n Step 3 : en= γn, ei= γi+ ei+1/βi for i = 1, . . . , n − 1.

Applying the LU-decomposition and pre-multiplying the result by the matrix F gives us the differences between y and h. Subtracting these differences from y yield the estimated values of the first-order spline regression function h.

2.3 Second-order Spline Regression

In the previous section, where we discussed first-order spline regression, the roughness penalty was based on the first derivative of the estimated regression function h. In this section we will use the more common second-order or “cubic” spline regression where the roughness penalty is defined by the integrated squared second derivative. For a twice differ-entiable function h on the interval [a, b] the roughness penalty is now defined as

Zb a

(h00(x))2dx.

In order to obtain an overall assessment of the quality of a candidate estimator h we now, similarly as in the case of first-order spline regression, add the roughness penalty to the residual sum of squares and obtain the penalized sum of squares, i.e.

Sλ(h) = n X i=1 (yi−h(xi))2+ λ Z b a (h00(x))2dx. (2.6)

(18)

Definition 2.2 Let x1< x2< . . . < xnbe a set of ordered points called “knots” contained in some bounded interval [a, b]. Note that we implicitly assume that all values of the independent variable are unique, this assumption will be relaxed later on. A cubic spline is a continuous function h such that (i) h is a cubic polynomial over each of the intervals (a, x1), (x1, x2), . . . , (xn−1, xn), (xn, b) and (ii) these polynomial pieces fit together at the knots in such a way that h itself and its first and second derivatives are continuous at each xi.

When moreover the second and third derivatives of h are zero in a and b, then the cubic spline is said to be a natural cubic spline (NCS). These conditions are called the natural boundary conditions.

Many essentially equivalent representations of the NCS exist. One obvious way of specifying a cubic spline is to give all coefficients for each of the n + 1 cubic pieces, i.e. to specify

h (x) = αi+ βi(x − xi) + γi(x − xi)2+ δi(x − xi)3 for xi≤x ≤ xi+1, (2.7) with given constants αi, βi, γi, δiand i = 0, . . . , n, where we define x0= a and xn+1= b. Note that under the natural boundary conditions we have that γ0, δ0, γn and δn are all zero so that h is linear on the intervals [a, x1] and [xn, b].

However it turns out that the representation of a NCS in (2.7) is computationally not very convenient and therefore we will use the “value-second derivative representation” that is suggested in Green and Silverman (1993). Suppose h is a NCS with knots x₁< . . . < x_n and define hi = h(xi) and θi = h00(xi) for i = 1, . . . , n. Under the natural boundary conditions we have that θ1 = θn = 0. Next we define the vector h as (h1, . . . , hn)T and we define θ to be the (n − 2)-vector (θ2, . . . , θn−1)T. The vector h and the second derivatives in θ specify the estimated regression function h completely.

(19)

Let us define ∆_i= x_i+1−_x

i for i = 1, . . . , n − 1 and let Q be the n × (n − 2) matrix with entries qijfor i = 1, . . . , n and j = 2, . . . , n − 1, where

qj−1,j= ∆ −₁ j−1, qjj = −∆ −₁ j−1− ∆ −₁ j , qj+1,j = ∆ −₁ j

for j = 2, . . . , n − 1 and qij = 0 for |i − j| ≥ 2. Note that, similar to the vector θ, the columns of Q are numbered in a non-standard way, starting atj = 2, so that the top-left element of Q is given by q12.

Next we define the symmetric matrix R to be of size (n − 2) × (n − 2) with elements rij for i, j = 2, . . . , n − 1, that are given by

r_i,i = (∆_i−1+ ∆_i) /3 for i = 2, . . . , n − 1 ri,i+1 = ri+1,i = ∆i/6 for i = 2, . . . , n − 2,

and rij = 0 for |i − j| ≥ 2. The matrix R is strictly diagonal dominant and from standard arguments in numerical linear algebra it follows that R is strictly positive-definite, see e.g. Todd (1962). Therefore we can define the matrix K as follows

K = QR−1QT.

The necessary and sufficient condition for the vectors h and θ to specify a true NCS will be stated in the following theorem

Theorem 2.2 The vectors h and θ genuinely represent a natural cubic spline h if and only if the following condition holds true

QTh = Rθ. (2.8)

In that case the roughness penalty will satisfy Z b

a

(h00(x))2dx = θTRθ = hTKh.

The interested reader may want to consult Green and Silverman (1993) for a proof of the above theorem. Suppose now that vectors h and θ satisfy the condition in (2.8), then we can use the following matrix representation of the penalized sum of squares

(20)

where again y is defined as (y1, . . . , yn)T. We will temporarily assume that a value for λ has yet been selected so that the smoothing parameter is fixed, in chapter 4 we will discuss how one can select a suitable value for λ. We now would like to minimize the penalized sum of squares over all twice-differentiable functions on [a,b], to that end we differentiate (2.9) with respect to the vector h and equate the result to zero. Using simple matrix algebra we then deduce that

y = (I +λK) h,

= I +λQR−1QTh, (2.10)

Equivalently we have that the estimated regression function h is given by h = M_λy where the hat matrix is now defined as Mλ= (I + λQR−1QT)−1. Solving (2.10) for h and using the equality in (2.8) twice, we find that Rθ = QTy −λQTQθ and therefore θ is given by the solution of the following system of linear equations

R +λQTQθ = QTy, (2.11)

where the matrix R + λQTQ and vector QTy are known. When (2.11) is solved for θ then the vector h is easily calculated by h = y − λQθ and we have obtained the NCS for a given value of the smoothing parameter λ.

The system of linear equations in (2.11) is of the form Aθ = b, where we have that the matrix A = R +λQTQ and b = QTy. The corresponding solution is given by θ = A−1b. However, explicit computation of the inverse of a matrix is prone to serious numerical inaccuracies. Therefore a linear system of equations should not be solved by multiplying the inverse of the matrix A by the known right-hand-side vector b, see e.g. Gentle (1998). In the next section we will introduce a more efficient method to solve (2.11) for θ that is due to Reinsch (1967).

2.3.1 Reinsch Algorithm

(21)

Consider now the square matrix A of size n − 2 and the vector b of length (n − 2) as they are defined in the previous section, i.e. A = R + λQTQ and b = QTy. One can easily verify that the matrix A is a band matrix with five non-zero diagonals and is moreover symmetric and strictly positive-definite. Therefore we can apply the Cholesky decomposition, see amongst others Gentle (1998) and Nash (1990). Using this decomposition we have that

A = LDLT,

where D is a strictly positive and diagonal matrix and L is lower triangular with all diagonal elements equal to 1 and a lower bandwidth 2. Both matrices L and D are square matrices of size n − 2.

Before we proceed let us first introduce a convenient notation. For a given matrix M the main diagonal will be denoted by [m₀]. The first element of [m₀] will be denoted by [m₀]₁ and this is the top-left element of M. Similarly the k-th upper and lower diagonals are denoted by [m_+k] and [m−_k] respectively. In case M is symmetric, the upper and lower

diagonals are equal and we will omit the corresponding sign.

Following the approach as described in Green and Silverman (1993) we can now compute the non-zero elements of D and L as follows

[d₀]₁= [a₀]₁, [l−₁]₁= [a−₁]₁/[d₀]₁, [d₀]₂= [a₀]₂−[l−₁]2₁[d₀]₁

and for i = 3, . . . , n − 2 successively

[l−₂]_i−2 = [a−₂]_i−2/[d₀]_i−2

[l−1]i−1 = ([a−1]i−1−[l−1]i−2[l−2]i−2[d0]i−2) /[d0]i−1 [d0]i = [a0]i−[l−₁]2_i−1[d₀]_i−1−[l−₂]2_i−2[d₀]_i−2.

Recall that in the previous section we encountered the problem of solving the system of linear equations Aθ = b for the unknown (n − 2)-vector θ. The Cholesky decomposition of A above is useful for solving these equations very quickly. We now introduce the vectors u and v satisfying

Lu = b, Dv = u, LTθ = v,

and where the components of u are to be found by computing u1= b1, u2= b2−[l−₁]₁u₁and

(22)

Next we define v_i= u_i/[d0]ifor each i = 1, . . . , n − 2 and equate the coefficients of θ as follows θ_n−2= v_n−2, θ_n−3= v_n−3−_[l₋

1]n−3θn−2 and for i = 4, . . . , n − 1

θ_n−i = v_n−i−_[l₋₁_]_n−i_θ_n−i+1−_[l₋₂_]_n−i_θ_n−i+2_.

We now have obtained the vector of second derivatives θ and one may recall from the previ-ous section that we can now find the estimated second-order spline regression by computing h = y −λQθ.

2.3.2 Duplicate Values for the Independent Variable

Until now we assumed that the knots of the NCS were ordered such that x₁ < x₂ < . . . < xn. In this section we will relax this assumption by allowing for duplicate values of the independent variable. If one or more values of x correspond to multiple values of y, i.e. if we lose their one-to-one correspondence, then we will have to order the independent variable such that x₁≤_x

2≤. . . ≤ xn with xi = xi+1 for some value(s) of i. In that case the reciprocal of the first difference xi+1−_x_i _{is no longer defined and as a result the matrix Q is undefined.} Therefore we will generally write

x = H z,

where x = (x1, . . . , xn)T and z is the ordered m-vector containing the unique x-values. The matrix H will be defined shortly and obviously we have that m ≤ n.

In the following we will assume that g is a NCS with knots at the ziand we define the vector g = (g1, . . . , gm)T where gi = g(zi), the corresponding vector of second derivatives is again given by the (m − 2)-vector θ. In this section will show how to obtain the vector g using the data {z, HTy}, then-vector containing the estimated regression function corresponding to the original vector y can then be obtained by computing h = H g.

Consider now the n-vector w with w1= 1 and where wi= 1 if xi−xi−1> 0 and zero otherwise for i = 2, . . . , n. Then the (n × m)-matrix H is defined as follows

(23)

2.4. POINT-WISE CONFIDENCE INTERVALS 15

for i = 1, . . . , n and j = 1, . . . , m. Note that when all x-values are unique, i.e. m = n, the matrix H is equal to the identity matrix of sizen and the equations in this section will simplify considerably.

The matrix HTH is a diagonal matrix of sizem with elements ci, the number of times the element z_i is contained in x. Now we define ∆_i = z_i+1−_z_i _{for i = 1, . . . , m − 1 and let the} matrices Q and R be as before but now using the ∆i instead of the first differences of x. Doing so will yield a m × (m − 2)-matrix Q and a square matrix R of size m − 2.

The penalized sum of squares can then be expressed as follows

Sλ(g) = (y − H g)T(y − H g) + λgTKg, (2.12)

where we define again K = QR−1QT. Differentiating (2.12) with respect to the vector g and setting the result equal to zero gives us the following equality

HTy =HTH +λQR−1QTg. (2.13)

We now would like to solve (2.13) for g. Similar as the equality in (2.8) we now have that QTg = Rθ. Solving (2.13) for g can now be done similarly as before and we obtain that θ is given by the solution of

R +λQTHTH−1Q

θ = QTHTH−1HTy, (2.14) which can be found by using the Cholesky decomposition from subsection 2.3.1. When we have calculated the vector of second derivatives θ, the values of the NCS can be obtained by computing

h = HHTH−1HTy −λHHTH−1Qθ. (2.15)

2.4 Point-wise Confidence Intervals

(24)

We will use a Bayesian approach to variance estimation that was suggested in Wood (2006) so that confidence intervals with good coverage probabilities are easily obtained. Using the equalities in (2.14) and (2.15) we can obtain that the estimated regression function h is given by h = Mλy where the hat matrix Mλis given by

Mλ= H HTH−1 I −λQ R +λQTHTH−1Q −1 QTHTH−1 ! HT. (2.16)

Since Mλis a non-stochastic matrix and y is a random vector we obtain the following covari-ance matrix

Cov(h) = σ2MMT, (2.17)

where we temporarily omit the dependence on λ, and where σ2 is to be estimated, see also Eubank (1984). In order to estimate σ2 we will need a measure for the number of degrees of freedom. In Green and Silverman (1993) it is argued that in parametric regression, the model degrees of freedom equal the trace of the hat matrix. Analogously in non-parametric regression we can define the “equivalent degrees of freedom for noise” (EDF) as the trace of the matrix I − Mλ. In order to evaluate the trace of Mλwe will have to find the diagonal elements of the matrix A−1where A is now defined as

A = R +λQTHTH−1Q.

Note that A is a symmetric band matrix of size m − 2 with five non-zero diagonals. As in subsection 2.3.1 we will use the Cholesky decomposition and write A = LDLT. Computing the inverse of A can be done by computing A−1_{= LD}−1_LT _{which requires calculations with} exponential time complexity, see Green and Silverman (1993). However finding the central five diagonals of A−1, which are of particular interest here, can be done in linear time by the use of algorithm proposed by Hutchinson and de Hoog (1985).

(25)

The five diagonals of A−1can be found by computing all elements in the appropriate order. We start with computing the bottom-right elements of A−1, i.e. we calculate

[a−₀1]m−2 = 1/[d0]m−2 [a−₁1]_m−3 = −_[l₋₁_]_m−3_[a−1 0 ]m−2 [a−₀1]m−3 = 1/[d0]m−3−[l−₁]_m−3[a −₁ 1 ]m−3. For i = 4, . . . , m − 2 we compute [a−₂1]_m−i = −_[l₋ 2]m−i[a −₁ 0 ]m−i+2−[l−₁]_m−i[a −₁ −₁]m−i+1 [a−₁1]m−i = −[l−₁]_m−i[a −₁ 0 ]m−i+1−[l−₂]_m−i[a −₁ −₁]m−i+1 [a−₀1]_m−i = 1/[d0]m−i−[l−1]m−i[a

−₁

1 ]m−i−[l−2]m−i[a

−₁ 2 ]m−i. Then we can compute the remaining entries of the diagonals of A−1by

[a−₂1]₁ = −_[l₋₂_]₁_[a−1 0 ]3−[l−₁]₁[a −₁ 1 ]2 [a₁−1]1 = −[l−₁]₁[a₀−1]₂−[l−₂]₁[a−₁1]₂ [a−₀1]₁ = 1/[d₀]₁−_[l₋₁_]₁_[a−1 1 ]1−[l−₂]₁[a −₁ 2 ]1.

Now we have obtained all values of the (m − 2)-vector [a−₀1], i.e. we found all values of the main diagonal of A−1. Next we will have to find the diagonal elements of the matrix QA−1QT where we temporarily denote this diagonal m-vector as s. Let [q0] denote the main

diagonal of the matrix Q and recall that top-left element of Q is given by q₁₂. Similarly [q−₁]

and [q−₂] denote the first and second lower diagonals of Q. We can compute the elements of

the diagonal vector s as follows

(26)

The remaining two diagonal elements are given by

sm−1 = [q−₂]_m−3 [a−₁1]m−3[q−₁]_m−2+ [a −₁ 0 ]m−3[q−₂]_m−3 + [q−₁]_m−2 [a−₀1]m−2[q−₁]_m−2+ [a −₁ 1 ]m−3[q−₂]_m−3 sm = [q−₂]_m−2 [a−₀1]m−2[q−₂]_m−2 .

Now we have all diagonal elements of the matrix QA−1QT. The diagonal elements of the hat matrix Mλin (2.16) can now be easily obtained. From subsection 2.3.2 one may recall that the matrix HTH is a diagonal matrix with elementsci, the frequencies of the unique x-values. All the unique diagonal elements of the hat matrix can be obtained by computing 1/ci(1 − λsi/ci) for i = 1, . . . , m. In the main diagonal of Mλ, each unique element is simply repeated ci times so that Mλis a square (n × n)-matrix. The equivalent degrees of freedom for noise (EDF) is now defined as

df = n − tr (Mλ)

where tr(·) is the trace operator. Note that the EDF depends on the value for the smoothing parameter, it increases from n − m when λ = 0 (i.e. interpolation) to n − 2 for λ = ∞. In the latter case the spline regression is similar to an ordinary least squares regression since any value of the integrated squared second derivative is penalized infinitely. Based on the residual sum of squares we can define the estimate for σ2, i.e.

ˆ

σ2= (y − h) T

(y − h)

n − df (2.18)

Based on (2.17) and the estimated value of σ2we can calculate the standard errors of the es-timated smoothing spline at the design points. An approximate 95% point-wise confidence interval for the expected value of h(xi) is then given by

I95= h(xi) ± 1.96 q

ˆ

σ2 MMTii, i = 1, . . . , n. (2.19)

(27)

(28)

(29)

Chapter 3

Kernel Regression

3.1 Introduction

Whereas in spline regression the observations were weighted as to minimize the penalized sum of squared residuals, in kernel regression these observations are weighted using a so-called “kernel function”. This approach, which might be considered conceptually more sim-ple, typically uses a continuous, bounded and symmetric real-valued function K, also known as the “kernel”, which integrates to unity.

In order to explain the idea of kernel regression let again {x_i, yi}with i = 1, . . . , n be the values of x and y. We again assume that they are related according to the model in (2.1), i.e.

y_i= m(x_i) + _i, i = 1, . . . , n.

We now would like to estimate the unknown underlying regression function m. In kernel regression we try to estimate m(xt) by considering the weighted mean of y-values in the vicinity of xt. These weights are assigned by the kernel function and decrease with the distance between xtand the corresponding data point.

A variety of possible kernel functions are available but in Faraway (2005) it is argued that the kernel function is best chosen to be some smooth and compact real-valued function since a smooth kernel function ensures that the resulting estimated kernel function is also smooth, a feature which is often to be preferred.

(30)

22 CHAPTER 3. KERNEL REGRESSION

Secondly, a kernel function that is not compact (e.g. Gaussian) will yield small but positive weights for distant observations. Computationally these small values are best avoided since they may cause a numerical underflow. Moreover, a non-compact kernel requires the com-putation of the contribution of every single observation, whereas a potentially large number of observations will have zero contribution when using a compact kernel function, see also Härdle (1994).

Considering the above we will opt for the commonly used Epanechnikov kernel function, which enjoys some optimality properties, see also Epanechnikov (1969) and Bartlett (1963). The Epanechnikov kernel function, which can be shown to have optimal efficiency and which will be used throughout this paper, is defined by

K(x) =          0.751 − x2 for |x| ≤ 1 0 otherwise.

We will use this kernel function in the upcoming two sections where we discuss local con-stant and local linear kernel regression. In Faraway (2005) the authors argue that the par-ticular choice of the kernel function is not very important since any sensible choice for the kernel function will yield an acceptable estimation of the regression function. The choice for the bandwidth parameter on the other hand, which will be discussed in chapter 4, is critical to the performance of the estimator.

3.2 Local Constant Kernel Regression

In this section we will show how one can obtain an estimate for the underlying but unknown function m using the well-known local constant kernel estimation. This non-parametric method is also known as the “Nadaraya-Watson” estimator and a rich literature on this sub-ject is available. A kernel method which has attracted attention more recently is the local linear kernel regression and this method is discussed in section 3.3, see amongst others Fan (1993), Ruppert and Wand (1994) and Masry (1996).

(31)

3.2. LOCAL CONSTANT KERNEL REGRESSION 23

Following Härdle (2004) we will formulate the local constant kernel estimate in the point x_t as the solution of the following minimization problem

min β0 n X i=1 (y_i−_β₀₎2_K_λ_(x_t−_x_i_{) ,} _(3.1)

where we now use the notation K_λ(x) = K(λ−1x). Note that in computing the local constant kernel estimate in the point xtwe use weights that are decreasing with the distance between the data point and x_t. The solution of the minimization problem in (3.1) is given by the well-known weighted least squares solution where the i-th observation receives the weight K_λ(x_t−_x_i_{). Let X denote the (n×1)-vector of ones and let W be the diagonal weight matrix of} size n where the i-th diagonal element is given by Wii = Kλ(xt−xi). Then (3.1) is minimized by the usual formula for a weighted least squares estimator

ˆ β0(xt) =

XTW X−1XTW y,

where again y = (y1, . . . , yn)T. The local constant kernel estimator of the regression function m in the point xt is now given by

h(x_t) = ˆβ₀(x_t).

The reader can easily verify that the approach in this section is equivalent to the previously mentioned Nadaraya-Watson estimator as it was proposed by Nadaraya (1964) and Watson (1964).

The entire curve h can now be obtained by repeating the above computations for all values of xt, 1 ≤ t ≤ n. Let Zt now denote the row vector (XTW X)−1XTW which is of size (1 ×n) and observe that this row vector is dependent on xt, due to changes in the weight matrix W for each xt. We can now define the hat matrix Mλas follows

Mλ=                       Z1 Z2 .. . Z_n                       .

(32)

24 CHAPTER 3. KERNEL REGRESSION

discussed in section 2.4, one will need the diagonal elements of the hat matrix. These diago-nal elements are also known as the “leverage values” since they indicate how much influence each element y_i has on its own prediction, see also Eubank (1984). Therefore the leverage value of element yt is given by the t-th element of the row vector Zt or equivalently the t-th diagonal element of Mλ. Once the estimated regression function and all leverage val-ues have been obtained then the confidence intervals can be computed as was described in section 2.4.

3.3 Local Linear Kernel Regression

Whereas in the previous section we used local constant kernel regression, in this section we will use the more general local linear kernel regression. The idea of local linear regression is to fit a straight line for every xt, 1 ≤ t ≤ n and weigh all data points by how close they are to xt. Luckily the approach from the previous section can easily be generalized to local linear and higher order local polynomial kernel regressions.

Similarly as for local constant kernel regression we will formulate the local linear kernel estimator as the solution of a minimization problem. Using the data {xi, yi}with i = 1, . . . , n we have that the local linear kernel estimate in the point xt is given by the solution of

min β n X i=1 (yi−β0−β1(xi−xt))2Kλ(xt−xi) , (3.2)

where β is the vector of coefficients, i.e. β = (β0, β1)T. Again the matrix W is defined as the

diagonal weight matrix of size n where the i-th diagonal element is given by Wii = Kλ(xt−xi). The (n × 2)-matrix X is now defined as follows

X =                       1 x1−xt 1 x2−xt .. . 1 xn−xt                       .

Then the minimization problem in (3.2) can be solved by the usual formula for a weighted least squares estimator, i.e.

ˆ

(33)

3.3. LOCAL LINEAR KERNEL REGRESSION 25

which now yields the two dimensional vector ˆβ = ( ˆβ0, ˆβ1)T. Here ˆβ0, estimates the regression

function in the point xtand ˆβ1is an estimator for the corresponding partial derivative.

For obtaining the hat matrix M_λwe now define Z_t as the row vector

Zt= (1 0)(XTW X)

−1

XTW,

which is again of size (1 × n). Similarly as before the vector Z_tdepends on the value of x_t, in this case however, not only the weight matrix W , but also the matrix X varies with xt.

(34)

(35)

Chapter 4

Bandwidth Selection

4.1 Introduction

To this point we just assumed that a value of the smoothing parameter λ was given and we have shown how one can perform a spline or kernel regression. However, although not made explicit in some type of methods, choosing a suitable value for the smoothing parameter is imperative in any type of curve estimation. In polynomial regression for example, choosing the degree of the polynomial is essentially equivalent to the selection of the smoothing pa-rameter. In spline and kernel regression the smoothing parameter λ is made explicit in the method. In this chapter we will discuss how one can select a suitable value for the smoothing parameter.

For the problem of selecting a suitable value for the smoothing parameter there do exist different types of methods. One of them is a trial and error method where arbitrarily values for λ are selected until one is found that yields visually a satisfactory fit. This method is both time consuming and very subjective. Especially if similar bivariate relationships are to be analyzed, like education and income for various countries, one may prefer some objective data driven method for selecting λ. In this chapter we will present two of such data driven methods for estimating the smoothing parameter.

(36)

28 CHAPTER 4. BANDWIDTH SELECTION

4.2 Cross-validation and Generalized Cross-validation

Assume again that we have data {xi, yi} with i = 1, . . . , n and that the vectors x and y are related according to the model in (2.1). For a given value of the smoothing parameter we have shown in chapter 2 and 3 how to compute a spline and kernel regression respectively. Let now h_idenote the value of the estimated regression function at the point x_iand similarly we denote the true underlying function m evaluated at xi by mi. A natural measure to evaluate the quality of the estimated function is given by the “mean squared error” (MSE), which we can now define as

MSE(h) = 1 n n X i=1 (h_i−_m i)2.

But instead of focusing on the squared residuals between the vectors h and m, we are in-terested in the expectation of the MSE if we were to use another set of data that was again drawn from the model in (2.1). The expected mean squared error of the estimator h can be decomposed into a bias and a variance part as follows

E {MSE(h)} = 1 n n X i=1 var(hi) + 1 n n X i=1 bias(hi)2. (4.1)

Both the variance and the bias of the estimator are important and in order to obtain a good estimate of the underlying regression function we would like to choose the smoothing pa-rameter λ in such a way that the average MSE in (4.1) is minimal. Therefore we are left with the problem to minimize both the bias and the variance of the estimated regression function.

(37)

4.2. CROSS-VALIDATION AND GENERALIZED CROSS-VALIDATION 29

The leave-one-out cross validation score which is denoted as CV is defined as

CV(h) = 1 n n X i=1 (yi−h−i(xi))2,

where h−_i is the estimate of the estimated regression function h based on all observations

except {xi, yi}. The intuition for cross-validation is that it is a nearly unbiased estimate of the predictive risk which is defined as the sum of the squared bias and variance of the estimated regression function h. It might seem that computing the cross-validation score is very time consuming since we apparently need to compute the estimator n times, dropping out one observation at a time. Luckily there is a shortcut formula for computing the cross-validation score in the case of “linear smoothers”. The formal definition of a linear smoother as it is stated in Wasserman (2006) is as follows

Definition 4.1 An estimator h of m is a linear smoother if, for each x_t, there exists a vector `(x_t) such that h(x_t) = n X i=1 `_i(x_t)y_i

where `i(xt)denotes the i-th element of the weight function `(xt).

Note that both non-parametric estimators that are discussed in this paper, i.e. spline and kernel regression, are linear smoothers. The shortcut formula for a linear smoother, as it is defined in amongst others Wasserman (2006) and Green and Silverman (1993), states that we can write the leave-one-out cross-validation score as follows

CV(h) = 1 n n X i=1 yi−hi 1 − Mii !2 , (4.2)

(38)

In generalized cross-validation each diagonal element of the hat matrix in (4.2) is replaced by its average value which is given by n−1Pn

i=1Mii = n−1tr(M).Recall from section 2.4 that the equivalent degrees of freedom for noise was defined as n minus the trace of the hat matrix. In vector notation, when using generalized cross-validation we therefore minimize

GCV(h) = (y − h)

0

(y − h)

df2 , (4.3)

where df denotes the equivalent degrees of freedom for noise as it was defined in section 2.4. In general the bandwidth parameter that minimizes the generalized cross-validation score is close to the bandwidth that will minimize the leave-one-out cross-validation score, see Wasserman (2006).

When using spline or kernel regression we can now select the bandwidth parameter in such a way that the generalized cross-validation score in (4.3) is minimal. An elegant and robust algorithm for finding a minimizer is the “golden section search” which was introduced by Kiefer (1953). We will use this search algorithm combined with parabolic interpolation in order to find the value of λ that minimizes the generalized cross-validation score, see also Brent (2002).

Using data examples from Härdle (1991) we present in figure 4.1 the corresponding esti-mated regression functions using first- and second-order spline regression where the band-width parameter λ was selected by minimizing the generalized-cross validation scores.

(39)

4.3. MM-ESTIMATOR 31 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0 0.5 1 1.5 2 x y (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 0 2 4 6 8 10 12 x y (b) 1.5 2 2.5 3 3.5 4 4.5 5 5.5 40 50 60 70 80 90 100

duration (in minutes)

waiting (in minutes)

(c) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0 0.5 1 1.5 2 x y (d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 0 2 4 6 8 10 12 x y (e) 1.5 2 2.5 3 3.5 4 4.5 5 5.5 40 50 60 70 80 90 100

(f)

Figure 4.1–Data examples from Härdle (1991) with a first-order spline regression in the panels (a) to (c) and a second-order spline regression in the panels (d) to (f). In panels (a) and (d) we see data example A, already known from figure 2.1. Panels (b) and (e) show data example B which contains two outliers and the Old Faithful data in the remaining two panels provides us with the difficulties of smoothing real data. The dashed black lines for the data examples A and B represent the true and known function, the dashed blue lines show a 95% point-wise confidence interval.

4.3 MM-Estimator

(40)

Consider again the data {x_i, y_i} _{with i = 1, . . . , n and assume that x and y are related} ac-cording to the model in (2.1). We define y = (y1, . . . , yn)T, = (1, . . . , n)T and the vector m = (m1, . . . , mn)T, where midenotes the true underlying regression function m evaluated at the point xi. Using vector notation we now can write the model in (2.1) as follows

y = m + ,

where the elements of are uncorrelated random variables with zero mean and common variance σ2. Recall that in both spline and kernel regression the estimated regression func-tion is given by hλ= Mλy, here Mλis the hat matrix for the corresponding linear smoother and the dependence on the bandwidth parameter λ is now made explicit.

Let us now focus on the variance of the elements in the vector . Since these iare indepen-dently and identically distributed with zero mean one can easily verify that

E( T_{(I − M} λ)  df ) = σ2,

where df denotes the equivalent degrees of freedom for noise which was defined in section 2.4 and is equal to n minus the trace of the matrix Mλ. In addition, from the weak law of large numbers, we may safely assume, although Mλ should satisfy some regularity condi-tions, that for n large enough

T(I − Mλ) 

df ≈σ

2

. (4.4)

Notice that the approximate relationship in (4.4) is invariant to changes in the bandwidth parameter λ. This fact will be exploited by the MM-estimator in order to obtain a suitable value for the smoothing parameter. Consider now the following equality

E( (y − m) T (I − M∞) (y − m) n − k − (y − m)T(I − Mλ) (y − m) df ) = 0, (4.5)

(41)

4.3. MM-ESTIMATOR 33

Since the vector m is unknown and has to be estimated by h_λ, the “sample analogue” of the population moment condition in (4.5) is given by

C(λ) = (y − hλ) T_{(I − M} ∞) (y − h_λ) n − k − (y − h_λ)T(I − M_λ) (y − h_λ) df = (y − hλ) T_{(y − h} λ) − p n − k − (y − hλ)T(I − Mλ) (y − hλ) df , (4.6)

where p = (y − h_λ)TM∞(y − h_λ). In case of spline regression we will now prove that p = 0.

An algebraic proof of this equality will be quite involving and therefore we will use a more elegant proof by contradiction.

In section 2.3 of Green and Silverman (1993) the authors discuss the existence, uniqueness and optimality of the natural cubic spline and summarize their work in the following theo-rem

Theorem 4.1 Suppose n ≥ 3 and that x1, . . . , xnare points satisfying a < x1< . . . < xn< b. Given data points y1, . . . , yn,

Suppose that we have data {xi, yi} with i = 1, . . . , n with n ≥ 3 and that a < x1 < . . . < xn < b. Given a strictly positive smoothing parameter λ, let h be the natural cubic spline with knots at the points x1, . . . , xnfor which h = (I +λK)−1y, see also (2.10). Then, for any function h∗on [a, b] with continuous first and second derivative,

S_λ(h) ≤ S_λ(h∗)

with equality only if h∗ and h are identical. Here the functional Sλdenotes the penalized sum of squares as it was defined in (2.6).

(42)

Theorem 4.2 Let {x_i, y_i}_{with i = 1, . . . , n be the values of x and y and consider a strictly positive} smoothing parameter λ. Let hλdenote the first- or second-order estimated spline regression which can be written as h_λ= M_λy. Here the hat matrix equals either M_λ= (I + λF ∆−2FT)−1in case of first-order spline regression or Mλ= (I + λQR−1QT)−1 for second-order spline regression. Then we have that

M∞(y − h_λ) = 0,

where M∞is the corresponding hat matrix for an infinite smoothing parameter.

Proof. Suppose that M∞(y − hλ) , 0 and note that for λ → ∞ the estimated regression

func-tion h_λ converges to either a horizontal line in case of first-order spline regression or a straight line for second-order spline regression. This is because the roughness, which is defined as the integrated squared first and second derivative respectively, is penalized in-finitely.

Effectively, for first-order spline regression the penalized sum of squares Sλ(h) in (2.3) is therefore minimized over all functions on [a, b] for which the first derivative is zero every-where, so that h_λequals the mean of y everywhere. Similarly, in case of second-order spline regression the penalized sum of squares in (2.6) is minimized over all functions on [a, b] with zero second derivative everywhere, which yields the ordinary least square regression of y on x.

We assumed that M∞(y − hλ) , 0, subtracting these values from hλwill yield a lower sum of

squares while the roughness penalty remains unchanged. Therefore hλis not the minimizer of the penalized sum of squares S_λ(h), this contradicts the optimality stated in theorem 4.1 and we conclude that in first- and second-order spline regression we have that for any λ > 0 M∞(y − h_λ) = 0.

(43)

In local constant kernel regression one can easily obtain that for λ → ∞ the hat matrix converges to M∞=                       1/n 1/n · · · 1/n 1/n 1/n · · · 1/n .. . ... . .. ... 1/n 1/n · · · 1/n                       .

Therefore we have that in local constant kernel regression all elements of the vector M∞(y −

hλ) are equal to the mean of y − hλ. Premultiplying this result by the row vector (y − hλ)T yields the correct value for p.

For local linear kernel regression a similar argument holds true. Here we have that for λ → ∞ the hat matrix converges to

M∞= X

XTX

−₁

XT,

where X = (ι x) and ι denotes the n-vector of ones. So by means of a simple ordinary least squares regression we can obtain the value for p.

Given the data and a non-parametric estimate h_λ we have shown how one can obtain the value for p in (4.6). The second term in (4.6) can be obtained by performing another non-parametric regression of h_λ on x. Then, similar as for the generalized cross-validation we can use the golden section search algorithm and parabolic interpolation for finding the value of λ such that (4.6) is as close to zero as possible. One can show that in the golden section search algorithm the the width of the full search interval reduces at an optimal rate, see Press, Flannery, Teukolsky, and Vetterling (1992).

(44)

36 CHAPTER 4. BANDWIDTH SELECTION 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0 0.5 1 1.5 2 x y (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2 0 2 4 6 8 10 12 x y (b) 1.5 2 2.5 3 3.5 4 4.5 5 5.5 40 50 60 70 80 90 100

(c) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −1.5 −1 −0.5 0 0.5 1 1.5 2 x y (d) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −4 −2 0 2 4 6 8 10 12 x y (e) 1.5 2 2.5 3 3.5 4 4.5 5 5.5 40 50 60 70 80 90 100

(f)

Figure 4.2–Data examples from Härdle (1991) with a first-order spline regression in the panels (a) to (c) and a second-order spline regression in the panels (d) to (f). Here the bandwidth parameters are selected by the MM-estimator. Again the dashed black lines for the data examples A and B represent the true and known function, the dashed blue lines show a 95% point-wise confidence interval.

Note that the estimated regression functions in panels (b) and (e) are much closer to the underlying true function compared to regression function when the bandwidth parameter was chosen by generalized cross-validation. Based on this result we expect that the MM-estimator is more robust to outliers.

(45)

Another problem regarding the MM-estimator is that if all x-values are unique we have that, for λ → 0 and λ → ∞, C(λ) in (4.6) tends to zero, whereas we generally like to avoid these extremes. Therefore selecting an appropriate search range for the minimization of the absolute value of C(λ) might be important.

(46)

(47)

Chapter 5

Monte Carlo Simulations

5.1 Introduction

In the preceding chapters we have discussed how one can perform a non-parametric regres-sion, either by means of spline or kernel regression. Additionally we have discussed how a suitable value for the smoothing parameter λ can be selected, using either generalized cross-validation or the newly proposed MM-estimator.

This chapter will be concerned with a simulation study in order to evaluate the performance of the MM-estimator compared to the more conventional generalized cross-validation. We will follow the approach of Hurvich, Simonoff, and Tsai (1998) where also the performance of an alternatively introduced bandwidth selection method is evaluated.

We will evaluate the performance of both bandwidth selection methods for a variety of pos-sible regression problems. The Monte Carlo simulations examine the performance of both selection methods as they relate to the true regression function, the standard deviation of the errors and the regression estimator that is used. Consider once more the model in (2.1), i.e. for data {xi, yi}with i = 1, . . . , n we assume that the variables x and y are related according to the model

yi= m(xi) + i, i = 1, . . . , n. (5.1)

(48)

40 CHAPTER 5. MONTE CARLO SIMULATIONS

The purpose of regression analysis is now to estimate the true underlying regression func-tion m. In this chapter we will generate various artificial data sets and measure the quality of both bandwidth selection methods by judging how well the true regression function m is estimated. In this simulation study the role of the true underlying regression function m will be played by each of the following four functions

m1(x1) = exp {x − 1/3} for x < 1/3; exp {−2 (x − 1/3)} for x ≥ 1/3

m2(x2) = 0.3 exp

n

−_{64 (x − 0.25)}2o_{+ 0.7 exp}n−_{256 (x − 0.75)}2o m3(x3) = 10 exp {−10x}

m4(x4) = 1 − 48x + 218x2−315x3+ 145x4.

All of these function have been used in earlier Monte Carlo simulations, see Hurvich et al. (1998), Herrmann (1997) and Ruppert, J.Sheather, and Wand (1995). The predictor variables x1, . . . , x4 are all linearly equally spaced points on the unit interval and we draw n = 150

observations for each. The regression functions m₁, . . . , m₄all have different characteristics, m1 is a function with undefined first derivative at x = 1/3 and thus violates the standard

assumption of smoothness. The function m₂shows a degree of curvature that strongly varies with the values of the predictor, m3is a function with a trend but no fine structure and m4

shows little fine structure and a trend. In Hurvich et al. (1998) the function m₄ is argued to be in some sense “typical” of many regression situations. In figure 5.1 we present the regression functions m₁, . . . , m₄graphically.

(49)

5.1. INTRODUCTION 41 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 x y (a) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 x y (b) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 1 2 3 4 5 6 7 8 9 10 x y (c) 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 −2.5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 x y (d)

Figure 5.1–A graphical representation of the four test function m1, . . . , m4 that are being used in the

Monte Carlo simulations where they serve as the true underlying regression function which ought to be estimated. In panel (a)-(d) we present a graph of the functions m₁(x₁) = exp{x−1/3} for x < 1/3; exp{−2(x− 1/3)} for x ≥ 1/3, m2(x2) = 0.3 exp{−64(x−0.25)2}+0.7 exp{−256(x−0.75)2}, m3(x3) = 10 exp{−10x} and

m4(x4) = 1 − 48x + 218x2−315x3+ 145x4respectively, all defined on the unit interval.

(50)

42 CHAPTER 5. MONTE CARLO SIMULATIONS

Performing the non-parametric regressions while varying all the above factors still leaves us with the problem of measuring the quality of the estimated regression function. A natural measure for the goodness of fit is the root mean squared error (RMSE) which quantifies the difference between the regression function h and the underlying function m. However, since generalized cross-validation also minimizes a squared distance, we will additionally measure the quality of the estimated regression function by the mean absolute deviation (MAD).

In the Monte Carlo simulations we will consider every possible combination of the above factors. In order to say something about the quality of the estimated regression functions with λ selected by generalized cross-validation or the MM-estimator, we are interested in the optimal scenario. The best possible fit, either in terms of the RMSE or the MAD, depends on the type of regression used. We will start by computing this optimal fit by minimizing the RMSE or MAD between the regression function h and the underlying function m, given the values for x and y. It is important to realize that this is a different estimation problem than simply regressing m on x. The resulting optimal RMSE or MAD is now the best possible value for that particular type of regression.

After we have determined the optimal values of the RMSE and MAD, we suppose that the true function m is unknown and perform the regressions where the smoothing parameter λ is selected by using generalized cross-validation or the MM-estimator. The corresponding RMSE and MAD are then divided by their corresponding optimal value, so that the band-width selection method that yields the ratio closest to 1 outperforms the other.

Bandwidth Selection by Method of Moments Estimation in Kernel- and Spline-based Smoothing

Bandwidth Selection by Method of Moments

Estimation in Kernel- and Spline-based Smoothing

A thesis presented for the degree of

Master of Science in Econometrics

Bandwidth Selection by Method of Moments

Estimation in Kernel- and Spline-based Smoothing

Preface

Contents

Chapter 1

Introduction

Chapter 2

Spline Regression

2.1

Introduction

2.2

First-order Spline Regression

2.3

Second-order Spline Regression

2.4

Point-wise Confidence Intervals

Chapter 3

Kernel Regression

3.1

Introduction

3.2

Local Constant Kernel Regression

3.3

Local Linear Kernel Regression

Chapter 4

Bandwidth Selection

4.1

Introduction

4.2

Cross-validation and Generalized Cross-validation

4.3

MM-Estimator

Chapter 5

Monte Carlo Simulations

5.1

Introduction