Semiparametric Regression

(1)

Semiparametric Regression

Zoa de Wijn

June 27, 2016

Bachelor Project Supervisor: dr. B.J.K. Kleijn

Korteweg-de Vries Institute for Mathematics Faculty of Science

(2)

Abstract

The well-known linear regression model is commonly used for determining causal rela-tionships between random variables. However, this frequently used method has some significant drawbacks: the errors are strictly normally distributed, measurement errors are taken into account in the dependent variables exclusively and only linear compo-nents occur in the regression analysis. In this thesis, we will treat three important semiparametric generalizations of the parametric linear regression model: linear regres-sion with symmetric errors, errors-in-variables regresregres-sion and partial linear regresregres-sion. In the linear regression model with symmetric errors, we construct an adaptive esti-mator by combining score function estimates with Fisher information estimates. As a second generalization, we consider both the linear and general version of the functional and structural errors-in-variables model. In the linear generalization, we show that the maximum likelihood estimator is consistent in the functional version and even efficient in the structural version. For the general errors-in-variables model, we focus on the main idea behind the construction of consistent estimators without going into too much detail. Finally, we show that efficient estimation is also possible in the partial linear regression model. Hence, it appears that the parametric linear regression model can be extended to the linear regression model with symmetric errors without any asymptotic efficiency loss. On the other hand, generalizing parametric linear regression to either errors-in-variables regression or partial linear regression will induce an asymptotic effi-ciency loss. In general, the advantages of using semiparametric regression models are the decreased chance of model misspecification and improved flexibility. On the other hand, the main disadvantages are the asymptotic efficiency loss of efficient estimators and greater complexity when applying the semiparametric theory to concrete examples.

Title: Semiparametric Regression

Author: Zoa de Wijn, zoa.dewijn@student.uva.nl, 10639969 Supervisor: dr. B.J.K. Kleijn

Second grader: dr. P.J.C. Spreij Date: June 27, 2016

Korteweg-de Vries Institute for Mathematics University of Amsterdam

Science Park 904, 1098 XH Amsterdam http://www.science.uva.nl/math

(3)

1. Introduction

1.1. Frequentist Statistics and Asymptotics

The goal of statistical regression analysis is to describe, quantify and estimate aspects of the relationships between random variables. More specifically, statistical regression analysis helps researchers to understand how the values of dependent variables change when the explanatory variables are varied. Regression analysis is widely used for fore-casting and prediction, but it is also used to understand the causal relationships between random variables.

Throughout this thesis, we will only consider frequentist inferential procedures, relying on the central assumption that the data has a definite underlying distribution P0 to

which all the inference pertains. Moreover, we will often adopt the following customary assumption in frequentist statistics regarding the well-specifiedness of statistical models. Definition 1.1 (Well-specified). A statistical model P is well-specified if it contains the true distribution P0 of the data.

A common way of presenting a statistical model is a description in terms of a parame-terization. In this way, we can identify the model distributions Pθ∈ P with parameters

θ from the parameter space Θ.

Definition 1.2 (Parameterization). A model P is parameterized with parameter space Θ if there exists a surjective map Θ → P : θ 7→ Pθ.

If a model is parameterized, it is desirable that each parameter represents a unique probability distribution. Throughout this thesis, we will only consider estimation prob-lems in so-called identifiable statistical models.

Definition 1.3 (Identifiability). A parameterization of the model P is identifiable if the map Θ → P : θ 7→ Pθ is injective.

In general, different statistical experiments may give rise to fundamentally different statistical models. The most simple and approachable statistical models are identifiable by a finite-dimensional parameter space Θ. Models for which this property holds are called parametric models.

Definition 1.4 (Parametric model). A model P is called a parametric model if there exists an identifiable parameterization f : Θ → P : θ → Pθ, with Θ being

finite-dimensional. The set of parameters is usually denoted by Θ and the model is written as

(5)

Sometimes it is not possible to identify the model P with a finite-dimensional param-eter space Θ. Models with this property are called non-parametric models.

Definition 1.5 (Non-parametric model). A model P is called a non-parametric model if there exists no finite-dimensional Θ that parameterizes P.

In general, large models have a better chance of being well-specified than small models. Therefore, the statistician might prefer to use non-parametric models rather than para-metric models in order to avoid model misspecification from occurring. Throughout this thesis, we are primarily interested in a type of model that can be seen as a combination of both a parametric and non-parametric model, called the semiparametric model. Definition 1.6 (Semiparametric model). A model P is called semiparametric if there exists a parameterization of P such that the parameters have both a finite-dimensional and an infinite-dimensional component. The set of parameters is usually denoted by Θ × H and the model is written as

P = {P_θ,η : θ ∈ Θ, η ∈ H}.

In particular, semiparametric models can be applied to situations where the researcher wants to answer a parametric statistical question in combination with a non-parametric statistical model. The semiparametric approach is especially useful when the researcher wants to use a model that is less restrictive than parametric models, but more informative than non-parametric models. Moreover, the semiparametric approach has the obvious advantage that there is less risk that the model is misspecified.

In this thesis, we will frequently deal with the theory of asymptotic statistics in order to assess the asymptotic properties of estimators. Because it is useful to get familiar with some basic concepts prior to treating the second chapter, we will first introduce the definitions of consistency and asymptotic normality in a parametric setting. Consider an i.i.d. sample X1, . . . , Xn, corresponding to the parametric model P = {Pθ : θ ∈ Θ}.

An estimation procedure prescribes a sequence of estimators ˆθn for the parameter of

interests, that can be indexed by the size n of the sample used to calculate it. Asymptotic statistical theory is used to describe the behaviour of such an estimation procedure when the sample size n goes to infinity. An intuitively reasonable requirement for any estimation procedure is a property known as consistency. This property states that the sequence ˆθnapproaches the true parameter value if the sample size n approaches infinity.

Definition 1.7 (Asymptotic consistency). Let P = {Pθ : θ ∈ Θ} be a parametric

model and let ˆθn be a sequence of estimators. If the sequence of estimators ˆθn converges

to θ0 in Pθ0-probability, then ˆθn is said to be consistent for θ0.

Apart from the definition of asymptotic consistency, it is also important to introduce the definition of asymptotic normality. This property plays an important role in various theorems and the notion of asymptotic efficiency.

Definition 1.8 (Asymptotic normality). Let P = {Pθ : θ ∈ Θ} be a parametric

(6)

of estimators ˆθn for the parameter θ0 is said to be asymptotically normal if

√

n(ˆθn− θ0) N (0, Σθ0), (1.1)

where Σθ0 is often referred to as the asymptotic covariance matrix of ˆθn.

Clearly, we can compare the quality of estimator sequences ˆθnfor which the property

of asymptotic normality holds by comparing the corresponding covariance matrices. In particular, under certain restrictions, it will turn out that a sequence of estimators ˆθn is

efficient if the asymptotic covariance matrix Σθ0 equals the Cram´er-Rao lower bound. In

the upcoming sections, we will treat the notion of efficiency in more detail and introduce the class of so-called regular estimators, over which asymptotic optimality is achieved. We do not wish to assume a priori that the estimators are asymptotically normal and that normal limits are best will actually be an interesting conclusion. After treating the theory of parametric efficiency, we will consider the problem of efficient estimation in semiparametric models as well.

Throughout this thesis, we will discuss several examples of semiparametric regression models, which can be seen as generalizations of the simple parametric linear regres-sion model. Our main goal will be to find asymptotically optimal estimates for three semiparametric generalizations of the linear regression model. In order to get a clear understanding of those generalizations, we will use the next section to review some basic results regarding parametric linear regression, discuss various limitations of the standard approach and put forward three generalized semiparametric regression models that offer more flexibility.

1.2. Parametric Linear Regression

In this section we give a brief overview of some basic results regarding statistical in-ference in parametric linear regression models. We will first discuss the most common version of the parametric linear regression model, after which we mention various estima-tion procedures and highlight some limitaestima-tions of the parametric linear regression model. Suppose the data consist of n different observations (Yi, Xi1, . . . , Xik), for i = 1, . . . , n.

We will refer to Yi as the dependent variables and to Xi = (Xi,1, . . . , Xi,k) as the

re-gressors, covariates or explanatory variables. Moreover, we will refer to the differences Yi− XiTθ as the residuals of the observations. In statistics, the linear regression model

is used for describing scalar relationships between a dependent variable Y and one or more explanatory variables X. If the random variable Y depends on a single explana-tory variable, we will refer to this method as simple linear regression. On the other hand, if the random variable Y depends on more than one explanatory variable, then the process is called multiple linear regression. The linear relationship between the de-pendent and explanatory variables is being modelled through a parameter θ ∈ Rk and some unobserved error variable ε, adding noise to the linear relationship between the observed variables. There are several different frameworks in which the linear regres-sion model is used and the specific choice of the framework depends on the nature of

(7)

the data and on the inference task which has to be performed. In the case of ran-dom design, the explanatory variables are treated as ranran-dom variables. In the other interpretation, called fixed design, the explanatory variables are treated as known con-stants. In the standard linear regression model, it is assumed that the regressors are non-random, denoted by xi = (xi,1, . . . , xi,k). Hence, the model describing the

relation-ship between n dependent variables Y1, . . . , Yn and k-dimensional explanatory variables

(x1,1, . . . , x1,k), . . . , (xn,1, . . . , xn,k) takes the form

Yi = k

X

j=1

θjxi,j+ εi, (1.2)

where ε1, . . . , εnare assumed to be i.i.d. random variables, representing the noise in the

measurements. More specifically, in the simple linear regression model it is assumed that εi ∼ N (0, σ2), with finite variance σ2. Note that the observations Y = (Y1, . . . , Yn)T

form a vector in Rn and the regression coefficients θi are indices of a k-dimensional

vector θ = (θ1, . . . , θk)T. If we define the (n × k)-matrix X as the matrix containing the

elements xi,j, then we can rewrite the model as

Y = Xθ + ε, (1.3)

with ε = (ε1, . . . , εn)T the error-vector. The statistical inference mainly focuses on the

parameter θ and in many situations a constant is included in the regression analysis by taking xi,1 = 1, for i = 1, . . . , n. The corresponding element of θ is commonly referred

to as the intercept.

A large number of estimation procedures has been developed for the parametric linear regression model and the most common procedures are ordinary least-squares and max-imum likelihood estimation. When using the ordinary least-squares procedure, the goal is to minimize the sum of squared residuals Yi− xTi θ of the model. In particular, the

ordinary least-squares estimator for the model (1.2) minimizes S(θ) =

n

X

i=1

(Yi− xTiθ)2 = (Y − Xθ)T(Y − Xθ). (1.4)

This results in the estimator ˆ

θn= (XTX)−1XTY. (1.5)

Consequently, after we have estimated θ, we can define the predicted value ˜Ypr for Y ,

given covariates X, to be equal to the conditional expectation ˜

Ypr = E_θˆ_n[Y |X] = X ˆθn. (1.6)

More precisely, X ˆθn = X(XTX)−1XTY = PXY , where PX represents the projection

matrix on the space spanned by the columns of X. In the simple linear regression model, the ordinary least-squares estimator is identical to the maximum likelihood estimator. The following theorem states under which conditions we can determine the maximum likelihood estimator in a straightforward manner.

(8)

Theorem 1.9. Assume that the design matrix X in the multiple linear regression model (1.2) has full rank and that the errors εi are normally distributed, with mean zero and

variance σ2. Then, the maximum likelihood estimators for θ and σ2 are given by

ˆ

θn= (XTX)−1XTY, σˆ2n=

||Y − X ˆθn||2

n . (1.7)

The maximum likelihood estimator for θ clearly corresponds to the ordinary least-squares estimator, which minimizes the sum of squared residuals. Namely, the maximum likelihood estimator for θ maximizes the log-likelihood

θ 7→ log n Y i=1 1 √ 2πσ2e −1 2 (Yi−xTi θ)2 σ2 (1.8) = −1 2n log(2π) − 1 2n log(σ 2_{) −} 1 2σ2 n X i=1 (Yi− xTi θ)2, (1.9)

which is equivalent to minimizing the sum of squared residuals (1.4). Hence, the max-imum likelihood estimator for θ corresponds to the ordinary least-squares estimator. Consequently, from the properties of the maximum likelihood estimator, we can infer the behaviour of the ordinary least-squares estimator.

Despite the broad range of applications of the linear regression model as described above, there are some obvious limitations. First of all, the assumption that the regressors are not contaminated with measurement errors is not realistic in many settings. Dropping this assumption leads to the significantly more complicated errors-in-variables model. In fact, estimation based on the standard assumptions leads to inconsistent estimates, which means that the estimates do not converge to the true value of the parameter for large sample sizes.

Another obvious limitation of the standard approach is the fact that the errors are strictly normally distributed. Whenever the researcher believes that the errors are not normally distributed, the simple linear regression approach leads to model misspecification. In order to resolve this issue, it makes sense to extend the set of possible distributions H for ε in such a way that it contains a wider range of probability distributions on R that are symmetric about the origin. Not surprisingly, this approach gives rise to a semipara-metric extension of the simple linear regression model as well.

In some cases, the researcher might wish to add an additive non-linear influence from an-other random variable of largely unknown form. In particular, suppose that we observe a random sample from the distribution of X = (V, W, Y ) such that Y = θV + η(W ) + ε, for some unobservable error ε independent of (V, W ). This approach has the obvious advantage that the independent variable Y may depend in a non-linear manner on V and W through η. This approach is called partial linear regression and it is apparent that letting η range over a broad range of nuisance functions leads to another semiparametric generalization of simple linear regression.

(9)

In this thesis, we discuss the suggested generalized semiparametric regression models. Our main goal is to find asymptotically optimal estimates for three semiparametric gener-alizations of the linear regression model. For this purpose, the second chapter formulates a theory of the asymptotic optimality of estimators in differentiable statistical models. In the upcoming chapter, we will treat some important theorems on both parametric and semiparametric efficiency. Furthermore, we discuss the concept of adaptivity and treat the well-known symmetric location model as an introductory example. In the last chap-ter, we discuss optimal estimation procedures in three semiparametric generalizations of the linear regression model: linear regression with symmetric errors, errors-in-variables regression and partial linear regression.

(10)

2. Semiparametric Efficiency

In this chapter we deal with the problem of semiparametric estimation from a frequen-tist perspective. Our purpose is to compare the asymptotic performance of estimators in smooth semiparametric models. Before considering the problem of efficient estimation in semiparametric models, we give a brief review of the central ideas regarding effi-cient estimation in parametric models, based on Van der Vaart (1998). We will discuss asymptotic lower bounds for locally asymptotically normal models and show that the maximum likelihood estimator is in some sense asymptotically efficient.

2.1. Maximum Likelihood Estimation

Before arriving at a theory of asymptotic efficiency, we review some important results on the consistency and asymptotic normality of M -estimators. Furthermore, we will treat maximum likelihood estimation as a special case of M -estimation and show that, under certain mild conditions, maximum likelihood estimation is efficient.

In statistics, a popular method for finding an estimator ˆθn is to maximize a criterion

function Mn(θ). We will refer to this kind of estimators as M -estimators.

Definition 2.1 (M-estimator). Let X1, . . . , Xn be i.i.d. observations and let mθ be

known functions. Furthermore, let Mn be defined as

θ 7→ Mn(θ) = 1 n n X i=1 mθ(Xi). (2.1)

If ˆθn maximizes Mn(θ) over Θ, then ˆθn is called an M -estimator.

In many problems, the maximizing value is found by setting a system of partial deriva-tives equal to zero. Estimators satisfying this property are called Z-estimators.

Definition 2.2 (Z-estimator). Let X1, · · · , Xn be i.i.d. observations and let ψθ be

known functions. Consider a system of equations of the type Ψn(θ) = 1 n n X i=1 ψθ(Xi) = 0. (2.2)

If ˆθn satisfies Ψn(ˆθn) = 0, then ˆθn is called an Z-estimator.

For simplicity, we will use the term M -estimators for equations of type (2.1) as well as type (2.2). It is apparent that maximum likelihood estimation is a special kind of M -estimation. In particular, maximum likelihood estimators satisfy the following criterion.

(11)

Example 2.3 (Maximum likelihood estimation). Let X1, · · · , Xn be i.i.d.

obser-vations with common density pθ. Then the maximum likelihood estimator ˆθn maximizes

the likelihood θ 7→ n Y i=1 pθ(Xi). (2.3)

If the maximum likelihood estimator ˆθnmaximizes the likelihood, then ˆθn also maximizes

the log-likelihood criterion function θ 7→

n

X

i=1

log pθ(Xi), (2.4)

or equivalently, the function

Mn(θ) = 1 n n X i=1 log pθ pθ0 (Xi). (2.5)

Consequently, if the expectation of the summands in (2.5) under Pθ0 is finite, then the

law of large numbers implies that the corresponding asymptotic function equals M (θ) = Eθ0log

pθ

pθ0

(X). (2.6)

The number M (θ) is often referred to as the Kullback-Leibner divergence of pθ and pθ0.

Under the natural assumption that θ0is identifiable, namely Pθ6= Pθ0 for every θ 6= θ0,

it can be shown that M attains a unique maximum.

Theorem 2.4 (Unique maximum). Let pθ be a collection of probability densities such

that Pθ 6= Pθ0 for every θ 6= θ0, then M (θ) = Eθ0log pθ/pθ0 attains a unique maximum

at θ0.

If the estimator ˆθnis used for estimating θ, then it is clearly desirable that the

estima-tor ˆθn converges to θ in probability when the sample size n goes to infinity. If this is the

case for every θ ∈ Θ, then the estimator sequence ˆθnis called asymptotically consistent.

The following theorem states conditions under which M -estimators are consistent for the parameter of interest. Before stating this theorem, it is convenient to introduce the definition of near maximizers.

Definition 2.5 (Near maximizer). Given a random function θ 7→ Mn(θ), we say that

ˆ

θn nearly maximizes Mn if

Mn(ˆθn) ≥ sup θ∈Θ

Mn(θ) − oP(1). (2.7)

Consequently, consider the following theorem on the asymptotic consistency of M -estimation, under the assumption that ˆθn nearly maximizes Mn.

(12)

Theorem 2.6. Let Mn be random functions and let M be a fixed function such that, for every ε > 0, sup θ∈Θ |Mn(θ) − M (θ)| P −→ 0; (2.8) sup {θ:d(θ,θ0)≥ε} M (θ) < M (θ0). (2.9)

If ˆθn nearly maximizes Mn, then ˆθn converges in probability to θ0.

Proof. By definition 2.5, we directly derive that Mn(ˆθn) ≥ Mn(θ0) − oP(1). Moreover,

the assumption (2.8) implies that Mn(θ0) − M (θ0) P

−→ 0. It immediately follows that Mn(ˆθn) ≥ M (θ0) − oP(1).

By substracting M (ˆθn) from both sides and using assumption (2.8) it follows that

M (θ0) − M (ˆθn) ≤ Mn(ˆθn) − M (ˆθn) + oP(1) (2.10) ≤ sup θ∈Θ |Mn(θ) − M (θ)| + oP(1) P −→ 0. (2.11)

Assumption (2.9) implies that, for every > 0, there exists a number δ > 0 such that M (θ) < M (θ0) − δ for every θ with d(θ, θ0) ≥ . This results in the inclusion

{d(ˆθn, θ0) ≥ } ⊆ {M (ˆθn) < M (θ0) − δ}.

This inclusion in combination with (2.10) and (2.11) yields that P d(ˆθn, θ0) ≥ ≤ P M (ˆθn) < M (θ0) − δ

= P M (ˆθn) − M (θ0) > δ

P

−→ 0, showing that ˆθn converges to θ0 in probability.

There exist several other theorems that provide sufficient conditions for the consistency of M -estimators, but we will not treat those results as they fall outside the scope of this thesis. The next question of interest concerns the order at which the discrepancy ˆ

θn− θ converges to zero, for consistent estimates ˆθn. The two upcoming theorems enable

the statistician to compare the asymptotic performance of M -estimators and construct approximate confidence sets.

Theorem 2.7. Assume that θ 7→ ψθ(x) is twice continuously differentiable in a

neigh-bourhood U of θ0, for every fixed x. Suppose that P ψθ0 = 0, that P kψθ0k

2

< ∞ and that the matrix P ˙ψθ0 exists and is non-singular. Furthermore, assume that the second-order

partial derivatives are dominated by some integrable function ¨ψ(x) for every θ ∈ U , then every consistent estimator ˆθn such that Ψn(ˆθn) = 0 satisfies

√ n(ˆθn− θ0) N 0, (P ˙ψθ0) −1_{P ψ} θ0ψ T θ0(P ˙ψθ0) −1_. _(2.12)

(13)

In some situations, the conditions stated in the previous theorem are difficult to verify. In particular, the existence of the integrable function ¨ψ can be difficult to verify. The fact that various simple but important examples cannot be treated by the preceding approach has motivated the derivation of asymptotic normality of M -estimators by more refined methods. The following theorem offers an alternative way of showing that M -estimators are consistent, which assumes a Lipschitz condition instead of two derivatives.

Theorem 2.8. For each θ in an open subset U of Rk, let x 7→ ψθ(x) be a vector-valued

measurable function such that for every θ1 and θ2 in a neighbourhood of θ0,

kψ_θ₁(x) − ψθ2(x)k ≤ ˙ψ(x) kθ1− θ2k , (2.13)

for some ˙ψ ∈ L2(P ). Furthermore, assume that P kψθ0k

2 _{< ∞ and that θ 7→ P ψ} θ

is differentiable at a zero θ0, with non-singular derivative matrix Vθ0. If Ψn(ˆθn) =

Ψn(θ0) − oP(n−1) and ˆθn converges in probability to θ0, then

√ n(ˆθn− θ0) N 0, V_θ−1₀ P ψθ0ψ T θ0(V −1 θ0 ) T_. _(2.14)

Under the assumption of identifiability and mild regularity conditions, the maximum likelihood estimator is consistent for the true parameter θ0. Moreover, in view of the

previous results, we expect that the sequence√n(ˆθn−θ0) is asymptotically normal under

Pθ0 with mean zero and covariance matrix

Pθ0`¨θ0 −1 Pθ0`˙θ0`˙ T θ0 Pθ0`¨ T θ0 −1 . (2.15)

Under regularity conditions, it can be shown that this matrix reduces to the inverse of the Fisher information matrix Iθ0 = Pθ0`˙θ0`˙

T

θ0. Consequently, we conclude that

√

n(ˆθn− θ0) N (0, I_θ−1₀ ), (2.16)

under Pθ0. In the next section, it will become clear that this is a very important result

and that the maximum likelihood estimators are asymptotically optimal in the class of so-called regular estimators. However, the conditions under which the optimality of the maximum likelihood estimator is achieved are not entirely trivial. Therefore, we will dedicate the next sections to arrive at a sound theory of asymptotic efficiency.

2.2. Parametric Efficiency

In this section, we will consider efficient estimation in smooth parametric models. If we compare the results from this section with M -estimation, it will turn out that the maximum likelihood estimator has various asymptotically optimal properties. We first introduce two new definitions that formalize the notion of smooth parametric models and regularity.

In order to arrive at a theory of efficiency of estimation in smooth parametric mod-els, we will have to introduce two new concepts. The first concept characterizes a class of estimators over which asymptotic optimality is achieved.

(14)

Definition 2.9 (Regularity). Let Θ ⊆ Rk be open. A sequence of estimators ˆθn for

the parameter θ is said to be regular at θ if, for all h ∈ Rk, √ n ˆ θn− θ + √h n θ+h/√n Lθ, (2.17) independent of h.

In other words, regularity describes the property that the convergence of the estimates ˆ

θn to a limit distribution is insensitive to n-dependent parameter perturbations of size

h/√n. It is important to realize that some estimators of interest, such as shrinkage esti-mators, are not regular. In the following example, we discuss the asymptotic properties of Hodges’ estimator, which does not belong to the class of regular estimators.

Example 2.10 (Hodges’ estimator). Let P = {N (θ, 1) : θ ∈ R} be a parametric model, consisting of normal distributions with expectation θ and known variance equal to one. A natural estimator ˆθn for the parameter of interest θ is the sample mean. Hodges’

estimator Hn is defined as

Hn=

( ˆ_θ_n _{if |ˆ}_θ_n_{| ≥ n}−1/4_,

0 if |ˆθn| < n−1/4.

(2.18)

We will employ the properties of ˆθnto learn more about the asymptotic behaviour of Hn.

First of all, denote Zn=

√

n(ˆθn− θ) and consider θ 6= 0. It follows that

P |ˆθn| < n−1/4 = P − n−1/4< ˆθn< n−1/4

= P √n(−n−1/4− θ) < Z_n<√n(n−1/4− θ).

For both θ > 0 and θ < 0 this probability converges to 0, as n → ∞. Hence, the limit behaviour of Hn is the same as that of ˆθn for every θ 6= 0. On the other hand, for θ = 0

it follows that P |ˆθn| ≥ n−1/4 = P {ˆθn≥ n−1/4} ∪ {ˆθn≤ −n−1/4} = P {Zn≥ n1/4} ∪ {Zn≤ −n1/4} ≤ P Z_n≥ n1/4 + P Z_n≤ −n1/4.

By definition of Zn, it is apparent that the last expression converges to zero as n → ∞.

Hence, the preceding displays give an accurate representation of the limiting behaviour of Hn under various θ ∈ Θ. In particular, we conclude that

√ n(Hn− θ) θ N (0, 1) if θ 6= 0; (2.19) √ n(Hn− θ) 0 0 if θ = 0. (2.20)

At a first glance, Hodges’ estimator Hn seems to be an improvement on the maximum

(15)

However, if we consider the risk function θ 7→ Eθ(Hn− θ)2, arbitrary large peaks appear

close to 0 as n → ∞. Hence, we deduce that Hn has better asymptotic performance

at 0 at the expense of worse behaviour for n-dependent points close to zero. It can also be shown that Hodges’ estimator is not regular and is not contained in the class of regular estimators. This example shows that a definition of what constitutes asymptotic efficiency is apparently not as straightforward as it might seem.

Hodges’ estimator is the first well-known example of a so-called superefficient estima-tor, i.e., it will become clear that it attains smaller asymptotic variance than regular efficient estimators.

Apart from regularity conditions on estimators, we will impose important smoothness conditions on the parametric models as well. More precisely, we will require models to be locally asymptotically normal in the following sense.

Definition 2.11 (Local asymptotic normality). Let Θ ⊆ Rk be open and assume that the parametric model P = {Pθ : θ ∈ Θ} is dominated by a σ-finite measure µ with

corresponding densities pθ. The model is said to be locally asymptotically normal at θ0

if, for any converging sequence hn→ h,

log n Y i=1 p_θ₀_+h_n_/√ n pθ0 (Xi) = hTΓn,θ0 − 1 2h T_I θ0h + oP_θ0(1), (2.21)

for random vectors Γn,θ0 such that Γn,θ0

θ0

Nk(0, Iθ0).

In Van der Vaart (1998), it is shown that local asymptotic normality can be achieved under another condition, namely differentiability in quadratic mean.

Definition 2.12 (Differentiable in quadratic mean). Let Θ ⊆ Rk be open and

assume that the parametric model P = {Pθ: θ ∈ Θ} is dominated by a σ-finite measure

µ with corresponding densities pθ. The model is said to be differentiable in quadratic

mean at θ0, if there exists a measurable score function ˙`θ0 such that

Z _√ pθ0+h− √ pθ0 − 1 2h T_`_˙ θ0 √ pθ0 2 dµ = o(khk2), (2.22) as h → 0.

Moreover, in some models, differentiability in quadratic mean can be achieved under conditions that are easier to verify. The next lemma will prove to be particularly useful for the so-called symmetric location model.

Lemma 2.13. Let Θ ⊆ Rk be open and let pθ be a µ-probability density. Assume that

the map θ 7→ ppθ(x) is continuously differentiable for every x. Furthermore, if the

elements of the matrix Iθ =R ( ˙pθ/pθ)( ˙pTθ/pθ)pθdµ(x) are well defined and continuous in

(16)

In fact, throughout this thesis we will only need the one-dimensional version of the preceding lemma. The properties of regularity and local asymptotic normality come together through the following theorem, which serves as a foundation for the convolution theorem.

Theorem 2.14. Let Θ ⊆ Rk be open and let P = {Pθ : θ ∈ Θ} be local asymptotically

normal at θ0 with non-singular Fisher information Iθ0. Furthermore, let ˆθn be regular

estimates in the models {P_θ₀_+h/√

n : h ∈ Rk}. Then there exists a randomized statistic

T in the normal location model {Nk(h, Iθ−10 ) : h ∈ R

k_{} such that T − h ∼ L}

θ0, for all

h ∈ Rk.

In fact, this theorem provides regular estimates with a limit in the form of a statistic in a model in which the only parameter is the location of a normal distribution. In other words, the weak limit distribution that describes the local asymptotic behavior of ˆθn

under P_θ₀_+h/√_nequals the distribution of T under h, for every h ∈ Rk. In Van der Vaart (1998), it is shown that this theorem also gives rise to H´ajek’s convolution theorem.

Theorem 2.15 (Convolution theorem). Let Θ ⊆ Rk be open and let P = {Pθ :

θ ∈ Θ} be local asymptotically normal at θ0 with non-singular Fisher information Iθ0.

Furthermore, let ˆθn be regular estimates with limit distribution Lθ0. Then there exists a

probability distribution Mθ0 such that,

Lθ0 = Nk(0, I

−1

θ0 ) ? Mθ0. (2.23)

In particular, if Lθ0 has covariance matrix Σθ0, then this matrix satisfies Σθ0 ≥ I

−1 θ0 .

Consequently, the notion of asymptotic optimality among regular estimators reduces to minimality of the covariance matrix Σθ0. This yields the following definition of

so-called best-regular estimates for θ0.

Definition 2.16 (Best-regularity). A regular estimator is called best-regular if the covariance matrix Σθ0 equals I

−1

θ0 . Regular estimators for which this property holds are

also called (asymptotically) efficient.

Finally, we state a theorem that characterizes the efficiency of regular estimates ˆθnin

terms of a weakly converging sequence and asymptotic linearity in the score function. Theorem 2.17 (Asymptotic linearity). Let Θ ⊆ Rkbe open and let P = {Pθ : θ ∈ Θ}

be local asymptotically normal. Then the estimators ˆθnfor θ are best-regular if and only

if, for all θ ∈ Θ,

√ n(ˆθn− θ) = 1 √ n n X i=1 I_θ−1`˙θ(Xi) + oPθ(1). (2.24)

In the upcoming section we extend the parametric notion of efficiency to semipara-metric models. The strategy for finding efficient estimates for θ is mainly based on the theory of parametric efficiency.

(17)

2.3. Semiparametric Efficiency

The desire to answer essentially parametric statistical questions without concessions that might lead to model misspecification is what motivates the use of semiparametric sta-tistical models. Although it is possible to formulate more general models, we choose to parameterize the model distributions in terms of both a finite-dimensional parameter of interest θ ∈ Θ, for Θ ⊆ Rkopen, and an infinite-dimensional nuisance parameter η ∈ H. Semiparametric efficiency bounds are of great importance, because such bounds quantify the efficiency loss that can result from a semiparametric rather than a parametric ap-proach. Throughout this section, we will extend the theory of parametric efficiency to the theory of semiparametric efficiency. As usual, we assume that the true distribution P0

is contained in the identifiable model, implying the existence of unique (θ0, η0) ∈ Θ × H

such that P0 = Pθ0,η0. Furthermore, we assume that the model is dominated by a σ-finite

measure µ with corresponding densities pθ,η. Throughout this section, we reflect on the

arguments given in Van der Vaart (1998) in a slightly simplified manner. It is important to realize that in some semiparametric problems, the constructions put forward in this section are difficult to apply. However, the applicability of the constructions is greatly simplified if the existence of so-called least-favourable submodels can be verified, which are in general non-existent.

The strategy for finding efficient estimators in semiparametric models is based on the theory of parametric efficiency. The idea of finding efficient estimates is based on taking infima over information bounds of smooth, parametric submodels. In this section, we will ensure that this collection of submodels is somehow rich enough to capture the sharp information bound for regular semiparametric estimates.

Definition 2.18 (Parametric submodel). Let P = {Pθ,η : θ ∈ Θ, η ∈ H} denote

a semiparametric model and let U be an open neighbourhood of θ0. Consider maps

γ : U → P : τ 7→ Pτ, such that Pτ = Pθτ,ητ for some (θτ, ητ) ∈ Θ × H and Pτ =θ0 = P0.

Then, we call the set P0 = {γ(τ ) : τ ∈ U } a parametric submodel of P.

By definition of the map γ : U → P : τ 7→ Pτ, it is apparent that P0 is

finite-dimensional and contained in the semiparametric model P. Since our main goal is to obtain true, sharp information bounds for estimation problems in semiparametric models as infima over information bounds of smooth parametric submodels, it makes sense to introduce the following definition.

Definition 2.19 (Smooth parametric submodel). Let P0 be a parametric submodel

of the dominated semiparametric model P. Suppose that there exists a measurable score function ˙` with finite Fisher information Iθ0 and P0` = 0, such that˙

log n Y i=1 p_θ₀_+h_n_/√ n p0 (Xi) = 1 √ n n X i=1 hT`(X˙ i) − 1 2h T_I θ0h + oP0(1), (2.25)

(18)

For simplicity, we temporarily restrict ourselves to one-dimensional submodels. Note that smooth parametric submodels, as defined above, are in fact locally asymptotically normal at θ0 (see definition 2.11). In this way, we obtain a collection P of smooth

parametric submodels, all of which pass through the point Pθ0,η0. Consequently, we

can introduce the following set consisting of all score functions corresponding to the collection P of smooth parametric submodels, namely

{ ˙` ∈ L2(P0) : P0 ∈P}. (2.26)

Because P0`˙2 is automatically finite, the set in the preceding display can be identified

with a subset of the space L2(P0). This set may not be closed, but it is possible to show

that there exists a unique ˜`_P in the L2(P0)-closure such that

˜

I_P = P0`˜2_P = inf { ˙`:P0∈P}

P0`˙2. (2.27)

We will refer to ˜I_P and ˜`_P as the P-efficient Fisher information and P-efficient score function for θ0 at Pθ0,η0 respectively, relative to P. In the upcoming theorems, it

will become clear that the nomenclature regarding ˜I_P and ˜`_P is justified and that ˜I_P indeed forms a lower bound in the semiparametric version of the convolution theorem if the collectionP is rich enough. Before stating the semiparametric convolution theorem, we have to extend to the concept of regular estimates to a semiparametric setting. For this purpose, let P denote a collection of smooth parametric submodels and consider the following definition.

Definition 2.20 (Semiparametric regularity). LetP be a collection of smooth sub-models and let ˆθn be a sequence of estimates for θ0. We call ˆθn regular with respect to

P, if ˆθn is regular for θ0 in all smooth submodels P0 ∈P (c.f. definition 2.9).

As a consequence of this definition, we can find Fisher information bounds in every smooth submodel P0∈P separately. This implies that, for a sequence ˆθn regular with

respect toP, a convolution theorem can be formulated in terms of the P-efficient Fisher information ˜I_P.

Theorem 2.21 (Semiparametric convolution theorem). Let P = {Pθ,η : θ ∈ Θ, η ∈

H} be a semiparametric model and define the tangent set as {c ˙` : c ∈ R≥0, P0 ∈P} ⊆

L2(P0). Then, for any ˆθn regular for θ0 with respect to P, the asymptotic covariance

matrix is lower bounded by ˜I_P. Moreover, if the tangent set is a convex cone, then every limit distribution Lθ0 can be written as Lθ0 = N (0, ˜I

−1

P ) ? M , for some probability

measure M .

In order to obtain the true infimal Fisher information, it is important to require that the collectionP of smooth parametric submodels is rich enough. If this is not the case, the resulting P-efficient Fisher information matrix may not be truly infimal and may not give rise to a sharp bound on the asymptotic variance of semiparametric regular estimators. This problem does not arise if we extend P to contain all possible smooth parametric submodels containing P0. Hence, this important observation gives rise to the

(19)

Definition 2.22 (Efficient Fisher information). LetP contain all smooth paramet-ric submodels. In this case, theP-efficient Fisher information and the P-efficient score function relative toP are referred to as the efficient Fisher information ˜Iθ0,η0 and the

efficient score function ˜`θ0,η0 respectively.

In practice, it is often difficult to handle the maximal collection of all smooth para-metric submodels, but in some situations it is possible to find a single smooth parapara-metric submodel P₀∗ for which the Fisher information equals the efficient Fisher information. If such a submodel P₀∗ exist, it is called a least-favourable or hardest submodel. However, in practice such least-favourable submodels rarely exist and in many semiparametric estimation problems it is not possible to construct least-favourable submodels. In order to extend the parametric notion of best-regularity to smooth semiparametric models, we introduce the following definition for best-regularity in a semiparametric setting. Definition 2.23 (Best-regularity). LetP be a collection of smooth submodels for P0

and let ˜I_P denote the corresponding P-efficient Fisher information. If the sequence ˆθn

is asymptotically normal with asymptotic covariance matrix ˜I_P−1, with ˜Iθ0,η0 = ˜IP, then

ˆ

θn is called best-regular. Regular estimators for which this property holds are also called

(asymptotically) efficient.

Finally, we will state a lemma that characterizes the best-regularity of semiparametric estimation procedures in terms of a weakly converging sequence and asymptotic linear-ity in the efficient score function. Recall that a similar result holds for the theory of parametric asymptotic efficiency, see theorem 2.17.

Theorem 2.24 (Semiparametric asymptotic linearity). Let P be a semiparametric model and let ˆθn be a sequence of regular estimates, with respect to P. The estimates

ˆ

θn for θ0 are best-regular if and only if

√ n(ˆθn− θ0) = 1 √ n n X i=1 ˜ I_θ−1 0,η0 ˜ `θ0,η0(Xi) + oP0(1). (2.28)

In the upcoming section we will treat a special case of asymptotic efficiency, namely adaptivity. In adaptive models it is possible to construct semiparametric inference pro-cedures that have the same asymptotic properties as the optimal parametric inference procedures.

2.4. Adaptivity

In general, enlarging a parametric model to a semiparametric model induces an efficiency loss. On the other hand, the semiparametric approach has the obvious advantage that it is less likely that the model is misspecified. In adaptive models it is possible to con-struct semiparametric inference procedures that have the same asymptotic properties as the efficient parametric procedures. In this respect, extending a parametric model to a semiparametric model can be achieved at no asymptotic efficiency loss. Therefore,

(20)

from an asymptotic point of view, the estimation problem in the semiparametric and parametric model are equally difficult.

Prior to treating the construction of adaptive estimates, we impose some regularity conditions on our semiparametric models, based on Bickel (1982). As usual, the models for which we discuss adaptivity will be described by a Euclidean parameter θ of interest, and a nuisance parameter η, i.e.,

P = {Pθ,η : θ ∈ Θ, η ∈ H}. (2.29)

We assume that the map (θ, η) 7→ Pθ,η is known and, for each ν ∈ H separately, we

define the parametric model

P_ν = {Pθ,ν : θ ∈ Θ}. (2.30)

Throughout this section on adaptive estimation, we assume that Pν satisfies condition

UR1, for each ν ∈ H separately.

Condition UR1 Consider the parametric model P = {Pθ : θ ∈ Θ}, with Θ an open

subset of Rk_{and densities p}

θ with respect to some σ-finite measure µ. We say condition

UR1 holds if, for all θ ∈ Θ:

(i) The function `θ = log pθ is differentiable in the components of θ a.e. Pθ, with

derivative denoted by ˙`θ = (∂`/∂θ1, · · · , ∂`/∂θk);

(ii) The Fisher information matrix I(θ) exists and is finite;

(iii) The model is differentiable in quadratic mean, uniformly in some neighbourhood U of θ, and

Pθ+h({pθ(Xi) = 0}) = o(khk2), (2.31)

as h → 0.

(iv) There exist estimates ˜θnof θ such that

√

n(˜θn− θ) = OPθ(1).

The following definition simplifies the notion of non-singularity of the Fisher information. Definition 2.25 (Regularity). Let P = {Pθ : θ ∈ Θ} be a parametric model. We call

a point θ ∈ Θ regular if I(θ) is non-singular and I(·) is continuous at θ.

In particular, we call a point (θ, ν) regular if θ is regular in the parametric model Pν.

Moreover, we denote I−1(θ, ν) for the inverse of the Fisher information matrix in the parametric model Pν at θ. Consequently, we can state the following important definition

of adaptivity.

Definition 2.26 (Adaptivity). A sequence of estimates ˆθn is adaptive if and only if,

for every regular (θ, η), √

n(ˆθn− θn) N (0, I−1(θ, η)), (2.32)

(21)

In other words, in adaptive models it is possible to construct semiparametric proce-dures that have the same properties as efficient parametric proceproce-dures. It is also apparent that adaptive estimates, if they exist in the first place, are efficient for every parametric model Pν separately. For the explicit construction of adaptive estimators we will use

so-called discretized estimators, introduced by Le Cam (1960).

Definition 2.27 (Discretized estimator). Assume that the parameter space Θ is k-dimensional and let ˆθn be

√

n-consistent estimates of θ, i.e., such that √n(ˆθn− θ) =

O_P_θ(1). If ˆθn only takes values in the grid Gc= {cn−1/2l : l ∈ Zk} ∩ Θ for some positive

c ∈ R, then we call ˆθn a

√

n-consistent discretized estimator.

We will frequently use ordinary √n-consistent estimates ˜θn to construct discretized

estimates, by taking the point in the grid Gc closest to ˜θn. In order to avoid confusion,

we will from now refer to this estimator as the discretized estimator and denote it by ¯θn.

One of the main advantages of using√n-consistent discretized estimates, is the property that they can be treated as non-stochastic sequences in various proofs.

For simplicity, we will also introduce the following setH , including the space of possible score functions:

H = {hh : Rk× Θ → Rk and Z

h(x, θ)Pθ,η(dx) = 0, for all θ ∈ Θ, η ∈ H}.

For convenience, we also introduce the notation ˜

λ(x; θ, η) = ˙`(x; θ, η)I−1(θ, η). (2.33)

Note that the inverse I−1(θ, η) is well-defined, since we only need ˜λ for θ such that I(θ, η) is non-singular. Two other properties that are important to highlight are the conditions UR1 _{in combination with conditions E and S, defined as follows.}

Condition E For every regular (θ, η), there exist consistent estimates ˜θn such that

√

n(˜θn− θ) = OPθ,η(1).

Condition S There exists a sequence of maps ˆ˜λn : (Rk)n → H such that, for every

regular point (θ, η), Z

|λˆ˜n(x, θn; X1, . . . , Xn) − ˜λ(x; θn, η)|2Pθn,η(dx) = oPθ,η(1),

whenever √n(θn− θ) stays bounded.

In Bickel (1982), it is motivated that these conditions are sufficient for the existence of adaptive estimates. We are now ready to prove the existence of adaptive estimates under conditions E and S.

(22)

Theorem 2.28 (Adaptivity under S). If condition E and condition S hold, then θ is adaptively estimated by ˆ θn= ¯θn+ 1 n − m n X i=m+1 ˆ ˜ λ(Xi, ¯θn; X1, . . . , Xm), with ¯θn as in definition 2.27.

It is also possible to verify the existence of adaptive estimates under slightly different assumptions. More specifically, we will ensure that condition S* holds and show the equivalence with the original conditions.

Condition S∗ There exists a sequence of maps `ˆ˙n : (Rk)n → H such that, for every

regular point (θ, η), Z

|`ˆ˙n(x, θn; X1, . . . , Xn) − ˙`(x; θn, η)|2Pθn,η(dx) = oPθ,η(1),

whenever√n(θn− θ) stays bounded. Furthermore, there exist estimates ˆIn(X1, . . . , Xn)

of I(θ, η) that are consistent for all regular (θ, η).

These alternative assumptions in combination with theorem 2.28 yield the following important theorem on adaptive estimation.

Theorem 2.29 (Adaptivity under S*). If condition E and condition S* hold, then θ is adaptively estimated by ˆ θn= ¯θn+ 1 n − m n X i=m+1 ˆ˙ `(Xi, ¯θn; X1, . . . , Xm) ˆIn−, with ¯θn as in definition 2.27.

Once suitable score function estimates `ˆ˙n are found, it is relatively easy to construct

consistent estimates ˆIn for I(θ, η). The following theorem gives mild conditions under

which I(θ, η) can be estimated consistently for all regular (θ, η).

Theorem 2.30 (Consistent estimates Fisher information). Let ` be estimates forˆ˙ ˙

` satisfying condition S* and assume that condition E holds. Furthermore, assume that for all regular (θ, η) we have

1 n n X i=1 ˙ `T`(X˙ i; θn, η) − I(θ, η) = oPθ,η(1), (2.34)

whenever √n(θn− θ) stays bounded. Then, I(θ, η) is consistently estimated by

ˆ In= 1 n − m n X i=m+1 ˆ˙ `T`(Xˆ˙ i, ¯θn; X1, . . . , Xm),

(23)

Proof. For this proof, we use a trick of Le Cam and take ¯θn to be deterministic, with

√

n(θn− θ) = O(1). By condition S*, it follows that

Eθn,η 1 n − m n X i=m+1 ˆ˙ `T`(Xˆ˙ i, θn; X1, . . . , Xm) − ˙`T`(X˙ i; θn, η) X1, . . . , Xm ≤E_θ_n_,η `ˆ˙T`(Xˆ˙ _m+1, θ_n; X₁, . . . , X_m) − ˙`T`(X˙ _m+1; θ_n, η) X1, . . . , Xm =oPθn,η(1).

This directly implies that 1 n − m n X i=m+1 ˆ˙ `T`(Xˆ˙ i, θn; X1, . . . , Xn) − ˙`T`(X˙ i; θn, η) = oPθn,η(1).

First of all, note that, by local asymptotic normality and lemma A.17, the measures Pθn,η

and Pθ,η are mutually contiguous. By contiguity and condition (2.34), it is apparent that

ˆ

In is consistent in Pθn,η probability. Once again by contiguity, it follows that ˆIn is also

consistent for estimating I(θ, η) under Pθ,η probability. Therefore, the estimates ˆIn

consistently estimate I(θ, η), for every regular (θ, η).

In this thesis, we will apply the preceding results to the linear regression model with symmetric errors. If one wants to deal with more general models in which identifiability issues occur, an extension of our theory is needed. However, we will deliberately not treat this generalization of adaptive estimation, because these results are not applicable to the examples we discuss.

2.5. Symmetric Location

The best known model under which adaptivity occurs is undoubtedly the symmetric location model. The symmetric location model is often used when the researcher is only interested in estimating the center of symmetry of some unknown symmetric distribution. It is often sensible not to make any further assumptions about the model distributions and unknown symmetric distributions, in order to avoid model misspecification from occurring. We briefly discuss the main properties of the semiparametric symmetric location model and review some important results. In the next chapter, we give an extensive derivation of adaptive estimates for the more general linear regression model with symmetric errors.

Assume that the symmetric location model P = {Pθ,η : θ ∈ Θ, η ∈ H} is dominated

by some σ-finite measure µ. The center of symmetry θ is the parameter of interest and the parameter space Θ equals R. For the nuisance space H, consider all probability distributions η on R with continuously differentiable µ-densities gη, symmetric about

zero with finite Fisher information for θ, i.e., I(η) = E ∂ ∂θlog gη(X − θ) 2 = Z _g0 η(x) gη(x) 2 gη(x)dx < ∞.

(24)

Moreover, we will assume that the square root√gη is continuously differentiable for all

x ∈ R. Note that H is an infinite-dimensional set and that P is a semiparametric model in a strict sense. As usual, denote Pθ,η for the unique probability measure corresponding

the pair (θ, η). It is apparent that Pθ,η corresponds to the unique µ-probability density

x 7→ gη(x − θ) as well, which we will denote by pθ,η.

We have deliberately restricted ourselves to nuisance probability measures η for which the µ-density gη is continuously differentiable, with I(η) < ∞. Namely, continuously

differentiability implies absolute continuity and Bickel (1982) states that this ensures the regularity of all (θ, η) in the parameter space Θ × H. In fact, H´ajek (1962) has shown that (θ, η) is regular if and only if I(η) < ∞. In order to motivate that the symmetric location model is adaptive, we commence with the verification of the conditions stated in UR1. As usual, let Pν = {Pθ,ν : θ ∈ Θ} denote the parametric model consisting of

all probability distributions for fixed ν ∈ H. In order to ensure adaptivity, it is required that each parametric model Pν ⊂ P satisfies condition UR1 separately.

First of all, by definition of H, it is apparent that log pθ,ν is differentiable in θ, a.e. Pθ,ν.

The second condition is also easily verified by assuming that I(ν) < ∞, for all ν ∈ H. In order to verify the third condition, note that Θ is clearly an open subset of R. Apart from this, also note that Iθ is constant, hence well defined and continuous in θ. The two

previous statements in combination with the assumption that the map θ 7→ppθ,ν(x) is

continuously differentiable for every x implies, by lemma 2.13, that Z √ pθ+h,ν− √ pθ,ν− 1 2h ˙`θ,ν √ pθ,ν 2 dµ = o(khk2), (2.35)

as h → 0. It is easy to show that this convergence takes place uniformly in some neighbourhood of θ by definition of pθ,νand applying a simple coordinate transformation.

This result, in combination with

Pθ+h,ν({pθ,ν(Xi) = 0}) =

Z

{pθ,ν=0}

pθ+h,νdµ(x) = o(khk2), (2.36)

ensures that the third condition of UR1 is also satisfied. The fourth condition of UR1 is also relatively easy to verify. In fact, in Bickel (1982) it is claimed that Le Cam shows that the identifiability of θ and non-singularity of I(θ) imply the fourth condition. Con-sequently, for the symmetric location model all regularity conditions UR1 are satisfied. We will not derive an explicit expression of the adaptive estimator in the symmetric location model, but a derivation was obtained by Stone (1975). In fact, the symmetric location model can also be seen as a special case of the general linear regression model with symmetric errors, for which we will construct adaptive estimates in the upcoming chapter. Consequently, the symmetric location model is also adaptive and the point of symmetry of a symmetric distribution can be estimated equally well asymptotically, regardless of whether knowledge about the nuisance parameter is available or not.

(25)

3. Semiparametric Regression Models

Linear models, generalized linear models and non-linear models are all examples of para-metric regression models. When using either of these models, we believe that the re-lationship between the regressors and explanatory variables is fully known up to some finite-dimensional parameter. However, in many situations little about this relationship is known and the researcher might want to incorporate additional unknown, flexible and non-linear relationships between variables into the statistical regression analysis. In such cases, it is beneficial to base statistical inference on a semiparametric regression model instead. In this chapter, we will consider efficient estimating procedures for three popular generalizations of the parametric linear regression model.

3.1. Generalizations of Parametric Linear Regression

Recall that we have discussed the basic properties of the parametric linear regression model in the introduction. We have also pointed out various flaws of this simplis-tic approach and suggested linear regression with symmetric errors, errors-in-variables regression and partial linear regression as suitable generalizations. In each of those semi-parametric regression models, the semisemi-parametric component is presented in a different fashion: in the linear regression model with symmetric errors we extend the set H of error distributions, in the errors-in-variables model we introduce the semiparametric component in terms of the distribution of some unobserved variable Z and in the partial linear regression model the semiparametric component occurs as a smoothing component η instead. We will now take a closer look at the suggested generalizations of parametric linear regression.

Linear Regression with Symmetric Errors

In the parametric linear regression model, it is assumed that the errors εi are normally

distributed. This assumption is a serious limitation and implies model misspecification in most applications. For this reason, we extend the collection H of possible error dis-tributions for εi such that H contains all probability measures η for which the µ-density

g is symmetric about zero. Moreover, we will impose the condition that g is absolutely continuous with finite Fisher information for θ, i.e.,

I(η) = E ∂ ∂θ log g(X − θ) 2 = Z _g0 (x) g(x) 2 g(x)dx < ∞.

In this way we ensure that all parameters in the parameter space Θ × H are regular, c.f. definition 2.25. Surprisingly, it will turn out that this semiparametric generalization

(26)

of the parametric linear regression model is in fact adaptive. This means that the pa-rameter of interest θ can be estimated equally well asymptotically, regardless of whether knowledge about the nuisance parameter is available or not.

Errors-in-Variables

In standard linear regression procedures, it is assumed that the explanatory variables are measured without error and that the errors inherent in the model are only asso-ciated with the dependent variables. However, the widely used linear random-design regression models suffer from a phenomenon called attenuation bias. Even though noise in the dependent variable Y is accounted for by the error, noise in the explanatory variable X biases the estimated slope towards zero: an exaggerated amount of noise in the dependent variable X will blur any linear relationship between X and Y . For this reason, generalizations of the standard linear regression model have been suggested. An appropriate extension is to assume that the independent variables are also observed with errors, giving rise to the errors-in-variables model. In particular, errors-in-variables models are appropriate when all variables are experimentally observed. The use of the errors-in-variables model has proved to be fruitful in areas ranging from astrophysics to econometrics.

In the errors-in-variables model, we consider observations (Xi, Yi), which satisfy the

relationship

(

Xi = Zi+ εi;

Yi = fθ(Zi) + ξi,

(3.1)

where the Zi are viewed either as unknown constants or independent identically

dis-tributed random variables and fθ is a function indexed by some finite-dimensional

pa-rameter θ ∈ Θ. Furthermore, the i.i.d. errors (εi, ξi) are assumed to be independent of

Zi with mean zero. If we do not make any further assumptions about the distributions

of Z and (ε, ξ), then this model is unidentifiable. In fact, in Kendall and Stuart (1979) it is even shown that the model is unidentifiable when we assume that Z is Gaussian and suppose that ε, ξ are also independent Gaussian random variables with unknown variances. However, under certain circumstances and the assumption that Z and ε, ξ are Gaussian random variables, it can be shown that the parameters are identifiable whenever:

(i) Either Var(ε) = c0 or Var(ξ) = c1, for known c0, c1 ∈ R;

(ii) The ratio Var(ε)/Var(ξ) = c, for known c ∈ R.

Furthermore, the model is also identifiable when the covariance matrix Σ of (ε, ξ) equals Σ = σ2Σ0, where the matrix Σ0 is known. In Reiersøl (1950), it is even shown that

putting no restricting on Σ but requiring Z to be non-Gaussian ensures the identifiability of the errors-in-variables model. In particular, requiring Z ∈ [0, 1] and assuming that the errors are normally distributed also ensures that the model is identifiable. In the

(27)

same paper, various statements are made about the identifiability when (ε, ξ) have no Gaussian component. However, we will only require that Σ = σ2Σ0, for known Σ0, or

that Var(ε) is known, since we will be dealing with Gaussian errors exclusively. It will become clear that the errors-in-variables model is fundamentally different from standard regression procedures and that estimation problems require a different approach.

Partial Linear Regression

The partial linear regression model offers more flexibility than the standard approach by introducing an additive influence from a random variable of largely unknown form. In particular, suppose that we observe a random sample from the distribution of X = (V, W, Y ) satisfying

Y = θV + η(W ) + ε, (3.2)

with some unobserved error ε, independent of (V, W ). Thus, the independent variable Y is a regression on (V, W ) that is linear in V but may depend on W in a non-linear manner through η. It is often assumed that ε is normally distributed with mean zero. As usual, the parameter θ ∈ R is of interest, while the nuisance parameter η belongs to some infinite-dimensional function space H. In section 3.4 we impose additional smoothness conditions in order to obtain estimates for η that are sufficiently smooth and ensure that the model is identifiable.

3.2. Linear Regression with Symmetric Errors

In this section, we will construct an adaptive estimator for the more general linear re-gression model with symmetric errors. Furthermore, we will compare our results to two other general unidentifiable generalizations of the parametric linear regression model. For the results in this section, we will closely follow the notation and ideas put forward in Bickel (1982).

Linear Regression with Symmetric Errors

The linear regression model with symmetric errors is a generalization of both the para-metric linear regression model and the sympara-metric location model. We will describe this model structurally in terms of the random vector Z = (X, Y ), satisfying

Y = XθT + ε, (3.3)

with some unobserved error ε, independent of X. Consequently, our semiparametric model is of the form P = {Pθ,η : θ ∈ Θ, η ∈ H}, with Θ equal to Rk and the set

H consisting of all distributions η symmetric about 0. We will restrict ourselves to nuisance probability measures η for which the ν-density g is absolutely continuous, with 0 < I(η) < ∞. Namely, in Bickel (1982) it is shown that (θ, η) is regular if and only

(28)

if the ν-density g is absolutely continuous, with I(η) < ∞. We will also assume that ε and X are independent and that ε has distribution η, with η ∈ H. Furthermore, we will assume that E[XTX] is non-singular. It is not necessary to specify the distribution of X because our construction holds for any distribution such that E[XTX] is non-singular. It turns out that these conditions are sufficient for ensuring the existence of adaptive estimators. Recall that we will have to verify condition E and condition S* to ensure the existence of adaptive estimators. For this purpose, we will now proceed through the following steps:

(i) Find a suitable subset H0 ⊂H , consisting of possible score functions;

(ii) Construct√n-consistent estimates ˜θn, satisfying condition E;

(iii) Construct suitable score function estimates `, taking values in Hˆ˙ 0 and satisfying

condition S*;

(iv) Construct consistent estimates ˆIn for I(θ, η), satisfying condition S*;

(v) Conclude that all conditions for adaptive estimation are satisfied and use theorem 2.29 to construct the adaptive estimator ˆθn.

Step (i) Assume that the probability distribution of X is dominated by some σ-finite measure ν and let p be the density of X with respect to the dominating measure ν. Similarly, assume that η has density g with respect to a σ-finite measure µ. We can write the simultaneous density function of the vector Z = (X, Y ) with respect to the product measure ν × µ, given (θ, η), as follows:

f (x, y; θ, η) = p(x)g(y − xθT). Then the score function ˙` in θ is equal to

˙ `(x, y; θ, η) = ∂ ∂θlog p(x)g(y − xθ T₎ = ∂ ∂θlog p(x) + ∂ ∂θlog g(y − xθ T₎ = xg 0 g(y − xθ T_).

By symmetry of the measure η, anti-symmetry of the function g0/g and the independence of X and ε we directly see that

Eθ,η[ ˙`(X, Y ; θ, η)] = Eθ,η[Xg0(ε)/g(ε)]

= E[X]Eη[g0(ε)/g(ε)]

= 0.

Clearly, this means that the score function for θ lies inH0 ⊂H , consisting of functions

h : Rn× Θ → Rn _{such that}

(29)

with ψ a bounded and anti-symmetric function. It follows immediately that these func-tions form a subset of the set H according to the section on adaptivity. This subset H0

will be essential for constructing score function estimates.

Step (ii) In this step we show that condition E holds by constructing estimates ˜θn such

that √n(˜θn− θ) = OPθ,η(1), at all regular points (θ, η). Consider a function ψ : R → R

that is twice continuously differentiable with bounded derivatives. Furthermore, assume that ψ is strictly increasing and anti-symmetric. Define ˜θnas the Z-estimates that solve

n

X

i=1

Xijψ(Yi− Xiθ˜nT) = 0, (3.5)

uniquely for j = 1, . . . , k. As suggested in Bickel (1982), by the arguments of Huber (1973), the estimates ˜θn are

√

n-consistent for all regular (θ, η) and meet condition E. Step (iii) We have deliberately restricted ourselves to nuisance probability measures for which the ν-density g is absolutely continuous, with I(η) < ∞. Namely, in Bickel (1982) it is stated that (θ, η) is regular if and only if the density g is absolutely continuous with derivative g0, with I(η) < ∞. In other words, all points (θ, η) in the parameter space are regular. By adopting the same notation as was used in the preceding chapter, we calculate ˜ λ(x, y; θ, η) = xg 0 g(y − xθ T_){E[XT_X]I(η)}−1_,

with the last term being equal to I−1(θ, η). We are almost ready to construct suitable score function estimates, but we first consider the following lemma.

Lemma 3.1. Let ε1, ε2, . . . be i.i.d. random variables with common distribution η and

density g, with I(η) < ∞. Then there exists a sequence of functions qn : R × Rn → R

such that each qn is bounded and

Z qn(x; ε1, . . . , εn) − g0(x) g(x) 2 g(x)dx = oP(1), (3.6) as n → ∞.

In order to construct suitable estimates for the score function ˙`, let the estimator ¯

θn denote Le Cam’s discretized estimates (c.f. definition 2.27) based on the first n

observations and define

ˆ

εi = Yi− Xiθ¯nT(Z1, . . . , Zn).

These ˆεi’s estimate the residuals, based on the discretized estimates ¯θn from the first n

observations. Furthermore, define ψn(x; Z1, . . . , Zn) =

1

2qn(x; ˆε1, . . . , ˆεn) − qn(−x; ˆε1, . . . , ˆεn)

(30)

and define the functions `ˆ˙nas

ˆ˙

`n(x, y, θ; Z1, . . . , Zn) = xψn(y − xθT; Z1, . . . , Zn).

We directly see that the estimates`ˆ˙nbelong to H0, by the anti-symmetry of ψn.

There-fore, these estimates appear to be good candidates for consistent score function estimates. In order to show that these estimates also satisfy condition S*, consider

Z `ˆ˙_n(x, y, θ_n; Z₁, . . . , Z_n) − ˙`(x, y; θ_n, η) 2 f (x, y; θn, η)µ(dy)ν(dx) = Z xψn(y − xθTn; Z1, . . . , Zn) − x g0 g(y − xθ T n) 2 f (x, y; θn, η)µ(dy)ν(dx) = Z ψn(y − xθnT; Z1, . . . , Zn) − g0 g(y − xθ T n) 2 g(y − xθT_n)xxTp(x)µ(dy)ν(dx) ≤ Z qn(y; ˆε1, . . . , ˆεn) − g0 g(y) 2 g(y)µ(dy) E[XXT]. (3.7)

We will now show that the last expression in the previous display converges to zero in probability, by using the assumption that E[XXT] is finite. For this purpose, consider θn= θ + tnsuch that ktnk = O(n−1/2). Furthermore, let x1, · · · , xn be vectors and also

require that, for some M > 0 independent of n,

n

X

i=1

xitTntnxTi < M.

In Bickel (1982) it is stated that, by H´ajek and ˇSid´ak (1967), the product measure induced by ε1, · · · , εn is mutually contiguous (c.f. definition A.15) with respect to the

product measure induced by ε1− x1tTn, · · · , εn− xntTn, whenever I(η) < ∞. Moreover,

ktnk = O(n−1/2) in combination with the finiteness of E[XXT] yields n X i=1 XitTntnXiT = OPθ,η(1). Hence, by lemma 3.1, Z |qn(y; ε1− X1tnT, . . . , εn− XntTn) − g0 g(y)| 2_{g(y)dy = o} Pθ,η(1).

By the structure of ¯θnand its

√

n-consistency, we may replace εi− XitTn by ˆεi for every

1 ≤ i ≤ n. Substituting this expression into (3.7) and using the preceding inequalities is enough to establish

Z

|`ˆ˙n(x, y, θn; X1, . . . , Xn) − ˙`(x, y; θn, η)|2Pθn,η(dx) = oPθ,η(1).

In order for condition S∗ to be fully satisfied, we will commence with the construction of consistent estimates for the Fisher information in the upcoming step.

Semiparametric Regression