• No results found

Essays in econometric theory

N/A
N/A
Protected

Academic year: 2021

Share "Essays in econometric theory"

Copied!
163
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Tilburg University

Essays in econometric theory

Sadikoglu, Serhan

DOI: 10.26116/center-lis-1906 Publication date: 2019 Document Version

Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Sadikoglu, S. (2019). Essays in econometric theory. CentER, Center for Economic Research. https://doi.org/10.26116/center-lis-1906

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal

Take down policy

(2)

Essays in Econometric Theory

(3)
(4)

Essays in Econometric Theory

Proefschrift

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof. dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op maandag 20 Mei 2019 om 10.00 uur door

SERHAN SADIKOĞLU

(5)

PROMOTIECOMMISSIE:

PROMOTOR: prof. dr. B. Melenberg

COPROMOTOR: dr. P. Čížek

OVERIGE LEDEN: dr. A. Juodis

(6)
(7)

Acknowledgements

This thesis contains my research as a Ph.D. student at Department of Econometrics and Operations Research at Tilburg University. It would not have been completed without guidance and support that I received from several people. Here I would like to express my gratitude to some of them explicitly.

First and foremost, I would like to express my sincere appreciation to my supervisor Dr. Pavel Čížek who introduced the field to me in Econometric Theory course during my master studies at Tilburg University. I would like to thank him for his invaluable support, continuous guidance and help throughout my studies at Tilburg University .

I am also grateful to my promotor Prof. Dr. Bertrand Melenberg and my commitee members Dr. Art¯uras Juodis, Prof. Dr. Arthur van Soest and Prof. Dr. Bas Werker for their valuable time, insightful comments and suggestions which significantly improved this thesis.

My thanks also go out to all my colleagues in the Hague University of Applied Sciences for their support. I would like to extend my thanks to Rogier Busser for his helpful advice and encouragement to take another step in my career. Special thanks to my dear office mates in ST 2.95: Özcan Demir, Aurea Fernandez, Xiao Peng, Rishma Radjie, Arce-Helen Salazar and Maaike Stuart-Sengers.

I am thankful to my dear friends for their support and friendship. Special thanks to Baran Düzce, Enis Gümüş, Korhan Nazlıben, Haki Pamuk and Suphi Şen.

(8)
(9)

Contents

Acknowledgements i

1 Introduction 1

2 Misclassification-robust semiparametric estimation of single-index binary

choice models 3

2.1 Introduction . . . 3

2.2 Parametric indirect inference . . . 6

2.3 Indirect inference for the binary-choice model . . . 8

2.3.1 Semiparametric indirect inference . . . 8

2.3.2 Examples of auxiliary criteria . . . 10

2.4 Large sample properties . . . 12

2.4.1 Consistency of the feasible II estimator . . . 13

2.4.2 Asymptotic normality of the feasible II estimator . . . 16

2.4.3 Choice of auxiliary criterion . . . 19

2.5 Robustness to misclassification . . . 21

2.6 Monte Carlo simulations . . . 26

2.6.1 Simulation design . . . 26

2.6.2 Simulation results . . . 27

2.7 Extensions . . . 31

2.7.1 Single-index binary choice model with endogenous explanatory vari-ables . . . 31

2.7.2 Single-index binary choice model with dependent data . . . 32

2.8 Conclusion . . . 33

2.A Proofs of the limit theorems . . . 34

(10)

2.C Verification of the technical assumptions . . . 50

2.C.1 Assumption 6(ii) . . . 50

2.C.2 Assumption 6(iii) . . . 51

2.D Technical lemmas . . . 54

3 Plug-in estimation of threshold regression models 60 3.1 Introduction . . . 60

3.2 Threshold regression model . . . 62

3.2.1 Integrated difference kernel estimator . . . 63

3.2.2 Estimation of the conditional mean . . . 64

3.2.3 Algorithm . . . 68

3.3 Asymptotic analysis . . . 69

3.3.1 Asymptotics of IDKE . . . 69

3.3.2 Asymptotics of the plug-in estimator . . . 71

3.4 Extensions . . . 76 3.5 Simulations . . . 77 3.5.1 Linear model . . . 78 3.5.2 Varying-coefficient model . . . 78 3.5.3 Single-index model . . . 80 3.6 Application . . . 81 3.7 Conclusion . . . 84

3.A Proofs of the asymptotic theorems . . . 85

4 Nonseparable panel models with index structure and correlated random effects 96 4.1 Identification . . . 98

4.2 Estimation approach . . . 104

4.2.1 Average and outer product of differences of gradients . . . 105

4.2.2 GMM . . . 110

4.2.3 Dimension selection . . . 111

4.3 Simulation study . . . 112

4.3.1 Implementation . . . 112

4.3.2 Sample selection model . . . 113

(11)

4.3.4 Binary-choice model . . . 116

4.4 Conclusion . . . 117

4.A Proof of Theorem 1 . . . 118

4.B Proofs of auxiliary lemmas . . . 120

4.C Proofs of Theorems 2–4 . . . 133

4.D Jackknife . . . 138

(12)
(13)

Chapter 1

Introduction

The dissertation consists of three essays in the field of econometric theory. The first essay studies semiparametric estimation of single-index binary choice model. The second essay addresses threshold regression models with dependent data. The third essay explores semiparametric estimation of nonseparable panel data models with index structure and correlated random effects. All essays are joint works with Pavel Čížek.

In Chapter 2, we explore the binary choice regression model under the assumption that conditional probability of the response depends on the explanatory variables through a sin-gle index. We introduce a new class of semiparametric estimators for sinsin-gle-index binary choice models. The proposed estimators are based on the semiparametric indirect inference that suggests estimating the parameters of the model via possibly misspecified auxiliary criteria. A large class of considered auxiliary criteria includes the ordinary least squares, nonlinear least squares, and nonlinear least absolute deviations estimators. Besides deriv-ing the consistency and asymptotic normality of the proposed methods, we demonstrate that the proposed indirect inference methodology - at least for selected auxiliary criteria - combines weak distributional assumptions, good estimation precision, and robustness to misclassification of responses which often occurs in survey data. Finally, we investigate the finite-sample properties of the proposed estimators by Monte Carlo simulation studies and conclude that they perform adequately well in comparison to the existing estimators in the literature.

(14)

unified framework for threshold regreesion models by generalizing the recently proposed integrated difference kernel estimator (IDKE) of Yu and Phillips (2018) to absolutely reg-ular stationary time series. We propose to use IDKE as a plug-in estimator in a wide class of parametric, semiparametric, and nonparametric regression models to estimate the parameters of interest in multiple regimes determined by the value of the threshold vari-able. We prove the consistency and asymptotic normality of the proposed method and demonstrate its performance in comparison to existing threshold models in the linear, varying-coefficient, and single-index models via Monte Carlo simulation studies.

(15)

Chapter 2

Misclassification-robust

semiparametric estimation of

single-index binary choice models

2.1

Introduction

Binary-choice models are often considered in the form

Y = I XTβ0+ ε ≥ 0 , (2.1)

(16)

by developing a class of semiparametric estimators insensitive to misclassified responses. One of the principal approaches to semiparametric estimation is based on the condi-tional quantile restriction on the error term. Manski (1975) proposed the maximum score (MS) estimator that maximizes a discontinuous score function. To improve upon its slow convergence rate n−1/3 (Kim and Pollard, 1990), Horowitz (1992) developed the smoothed maximum score (SMS) estimator by smoothing the objective function of the MS estima-tor. Further, Blevins and Khan (2013) introduced local nonlinear least squares (LNLLS) estimator, imposing the conditional median restriction too. Although this estimator can be more easily computed, it cannot attain convergence rate n−1/2like its predecessors, and additionally, exhibits comparatively large finite-sample biases.

Another semiparametric estimation approach to model (2.1) relies on the single-index assumption P (Y = 1|X = x) = P Y = 1|XTβ0 = xTβ0 , where the conditional proba-bility of the response variable Y depends on covariate values only by means of their linear combination xTβ0 . Exploiting the single-index assumption, Ichimura (1993) constructed the semiparametric least-squares (SLS) estimator and Klein and Spady (1993) proposed a semiparametric maximum likelihood estimator (referred to as KS) for β0in (2.1). Both es-timators are√n-consistent and asymptotically normal under certain regularity conditions, but they are difficult to compute due to the complexity of their nonparametrically esti-mated objective functions as discussed, for example, in Horowitz and Härdle (1996), Rothe (2009), and Westerlund and Hjertstrand (2014). To mitigate that computational difficulty while retaining the single-index restriction, the average derivative estimation methods can be used (e.g., Powell et al., 1989; Horowitz and Härdle, 1996; Hristache et al., 2001a). Many average derivative estimators however involve high-dimensional nonparametric smoothing and are subject to the “curse of dimensionality,” which makes them unattractive from a practitioner’s point of view.

(17)

observations have unlikely values of covariates. A similar result would hold true for the KS estimator, which is based on the same objective function as the MLE. Accounting for misclassification as in Hausman et al. (1998), for instance, however requires assumptions concerning the misclassification probability and Meyer and Mittag (2017) document that these methods do not perform well in real-data applications relative to the standard MLE estimator, which ignores the presence of misclassification. We therefore attempt to develop estimation methods that do not rely on assumptions and modelling of the misclassifica-tion probability, but are instead designed to be rather insensitive to misclassificamisclassifica-tion of responses.

In this context, we introduce an alternative class of semiparametric estimators that alle-viates the above mentioned problems of existing binary-choice estimators under the single-index restriction. The proposed estimation methods should thus deliver √n-consistent estimates that are not subject to the curse of dimensionality, exhibit good finite-sample performance, and at the same time, are robust to misclassification of the responses. Con-sidering the single-index model P (Y = 1|X = x) = P Y = 1|XTβ0 = xTβ0, we therefore

(18)

The remainder of the paper is organized as follows. In Section 2.2, the parametric II is described, and in Section 2.3, we discuss the implementation of semiparametric II for the binary-choice models. In Section 2.4, the asymptotic properties of the proposed II estimators are derived. The robust properties for the main classes of the auxiliary criteria are studied in Section 2.5. Monte Carlo simulation results are presented in Section 2.6 to assess the finite-sample performance of the proposed estimators. The proofs are provided in the Appendix.

2.2

Parametric indirect inference

In this section, the parametric indirect inference is introduced following Gourieroux et al. (1993). Consider a general regression model

Y = m X, ε; β0 , (2.2)

where Y represents the dependent variable with a known conditional distribution function FY |X,β0, X is the (k + 1)-dimensional vector of explanatory variables with a distribution

function FX, β0 ∈ B ⊂ Rk+1 is the parameter vector, and ε is the unobserved error term with a known conditional distribution function Fε|X. It is often assumed that consistent estimation of the model parameters is complicated, impossible, or lacking some important features.

Next, the indirect inference requires an auxiliary criterion that is a function of a random sample {yi, xi}ni=1from the joint distribution of (Y, X) and of an auxiliary parameter vector

θ ∈ Θ ⊂ Rp, p ≥ k + 1. This might be an approximation of model (2.2) or a very simple linear regression model estimated by ordinary least squares, for instance, that partially captures the relationship between Y and X. To estimate the auxiliary parameter θ, the auxiliary criterion is maximized:

b θn= arg max θ∈Θ 1 n n X i=1 q (yi, xi; θ) . (2.3)

The distributions FX and FY |X,β0 fully characterize the data-generating process and are

(19)

limit Q∞(·) with a unique maximum at θ0 in the parameter space Θ:

θ0= arg max

θ∈Θ

Q∞ FY |X,β0, FX; θ .

To link the parameter vector β of the original model (2.2) to the auxiliary parameter vector θ, Gourieroux et al. (1993) defines the binding function by

b FY |X,β, FX = arg max θ∈Θ

Q∞ FY |X,β, FX; θ , (2.4)

where the auxiliary criterion is evaluated at some β ∈ B and θ0≡ b FY |X,β0, FX.

Under suitable regularity assumptions, bθncan be shown to be a consistent estimator of

θ0, but the binding function b (·) is unknown. Gourieroux et al. (1993) therefore proposed a simulation-based procedure to link β to θ and to estimate the parameter vector β0 using b

θn.

This procedure is based on the fact that, for any β ∈ B and known FY |X,β up to the parameter vector β, it is possible to simulate realizations of the random variable Ys(β) = m (X, ε; β) conditionally on X. Equivalently, given a sample size n, one can simulate from a known Fε|X a number S of sets of error termsεe1, ...,εes , whereεes= {εsi}n

i=1, s = 1, ..., S,

and generate S sets of simulated pathsye1(β) , ...,yeS(β) , whereeys(β) = {˜ysi(β)}ni=1and e

ysi (β) = m (xi, εsi; β), s = 1, ..., S, for any given β. For these S simulated samples, one can

compute S auxiliary estimates

e θsn(β) = arg max θ∈Θ 1 n n X i=1 q (yis(β) , xi; θ) , (2.5)

s = 1, ..., S. Under appropriate regularity conditions given in Gourieroux et al. (1993), e

θns(β) tends asymptotically to b FY |X,β, FX and the indirect inference estimator, denoted

by bβn, can be thus defined as

b βn= arg min β∈B " b θn− 1 S S X s=1 e θsn(β) #T Ω " b θn− 1 S S X s=1 e θns(β) # , (2.6)

where Ω is a positive definite weighting matrix. The estimate bβnis thus chosen to minimize

(20)

2.3

Indirect inference for the binary-choice model

Let us now propose the semiparametric indirect inference to estimate the parameters of the binary-choice model under the single-index restriction using the notation P (Y = 1|XTβ = t) = G (β, t). The single-index binary-choice model is formulated as

P (Y = 1|X = x) = P Y = 1|XTβ0 = xTβ0 = G β0, xTβ0 , (2.7)

where the identification of the parameters requires at least that β0 = 0, β10, ..., βk−10 , βk0T does not contain intercept, β00 = 0, and is scale normalized, for example, by βk0 = 1 as in Klein and Spady (1993). (Note that the intercept could be identified analogously to Chen, 2000, if the symmetry of the latent error distribution is imposed.) Throughout the paper we use P (Y = 1|XTβ = t) and G (β, t) interchangeably whichever makes the exposition clearer. In the rest of this section, we first describe the infeasible indirect inference estimator that assumes the knowledge of the function G(·) and is thus a special case of the parametric II estimator described in Section 2.2. Later, we introduce the feasible indirect inference estimator that is based on a nonparametric estimate of G(·) and is thus semiparametric in nature. Both infeasible and feasible indirect inference estimators are initially defined for a general auxiliary criterion in Section 2.3.1. Various examples of auxiliary criteria are then discussed in Section 2.3.2.

2.3.1 Semiparametric indirect inference

The proposed indirect estimation approach is now introduced using a general auxiliary estimator.

Let {yi, xi}ni=1 be a random sample from model (2.7) and the auxiliary estimator for

the sample is defined by

b

θn= arg max θ∈Θ

Qn(Y, X; θ) = arg max θ∈Θ 1 n n X i=1 q (yi, xi; θ) , (2.8)

where Θ ⊆ Rpwith p ≥ k −1 due to the normalizations introduced in the preceding section. By exploiting the single-index structure of the model (2.7), we define the binding function as

b G β, XTβ , FX = arg max θ∈Θ

(21)

Next, the auxiliary estimator for the simulated data has to be defined. Simulating data from model (2.7) requires random draws of Bernoulli random variables ysi(β) conditional upon xi. Assuming first that G(·) is known, one can generate uniformly distributed random

variables usi ∼ U (0, 1) for all i = 1, . . . , n and s = 1, . . . , S and define draws from the distribution of Y conditional on the single index by

yis(β) = I G(β, xTi β) ≥ usi . (2.10)

The auxiliary estimates for the S simulated data samples {yis(β) , xi}ni=1are obtained by

e θsn(β) = arg max θ∈Θ Qn(Ys(β) , X; θ) = arg max θ∈Θ 1 n n X i=1 q (yis(β) , xi; θ) , (2.11)

and the infeasible indirect inference estimator is then defined as

b βn= arg min β∈B " b θn− 1 S S X s=1 e θsn(β) #T Ω " b θn− 1 S S X s=1 e θns(β) # , (2.12)

where Ω is again a positive definite weighting matrix.

To define the feasible indirect inference (FII) estimator, G β, xTβ = P Y = 1|XTβ = xTβ has to be replaced by its consistent estimate ˆGn β, xTβ as in Westerlund and Hjertstrand

(2014). Given a value t of xTβ, this estimate can be based for example on the local polyno-mial estimators. We choose to estimate it by Nadaraya-Watson estimator as in Klein and Spady (1993), Ichimura (1993), and Rothe (2009) among many others in the literature. In particular, for any β and t, we estimate G (β, t) by

ˆ Gn(β, t) = n P j=1 Khn  xTjβ − t  yj n P j=1 Khn  xT jβ − t  , (2.13)

where Khn(·) = K (·/hn) is a one-dimensional kernel function and the bandwidth hn→ 0

as n → ∞.

Furthermore, to facilitate general asymptotic results, we additionally smooth the in-dicator function in (2.10) using a continuous distribution function K : R → [0, 1] and bandwidth λn. We thus define simulated responses by

(22)

where K(·) satisfies lim

s→−∞K (s) = 0 and lims→∞K (s) = 1 and bandwidth n 1/2λ

n → 0 as

n → ∞ (note that the choice of bandwidth λnis practically irrelevant once n1/2λn→ 0 as it

does not influence the asymptotic behavior of the estimator, see Section 2.4, and unreported simulations indicate a practically negligible difference between estimation based on the smoothed responses (2.14) and unsmoothed responses ˆyis(β) = I( ˆGn β, xTi β ≥ usi)).

Subsequently, the feasible indirect inference estimator bβnF II can be defined as

b βnF II = arg min β∈B " b θn− 1 S S X s=1 b θsn(β) #T ˆ Ωn " b θn− 1 S S X s=1 b θns(β) # , (2.15)

where the auxiliary estimator for the sth simulated path, s = 1, ..., S, equals

b θsn(β) = arg max θ∈Θ Qn ˆYs(β) , X; θ  = arg max θ∈Θ n−1 n X i=1 q (ˆysi (β) , xi; θ)

and ˆΩnis a consistent estimate of the positive definite weighting matrix Ω. For an observed

sample {yi, xi}ni=1, the proposed feasible indirect inference estimator thus consists of the

following steps: (i) obtaining the auxiliary estimate bθnfor the observed sample; (ii) drawing

S simulated samples {usi}n

i=1 of errors usi ∼ U (0, 1); (iii) for any given β ∈ B, estimating

ˆ

Gn β, xTi β of G β, xTi β for all i = 1, . . . , n using the observed sample {yi, xi}ni=1 and

obtaining S simulated samples {ˆysi (β) , xi}ni=1 and the corresponding auxiliary estimates

b

θns(β); and (iv) estimating β by minimizing the distance (2.15) between the auxiliary estimates obtained for the observed and simulated samples.

2.3.2 Examples of auxiliary criteria

The infeasible and feasible II estimators have been introduced for a general auxiliary cri-terion. In this section, several examples of auxiliary criteria are presented.

First, consider auxiliary criteria based on the least-squares loss function

1 n n X i=1 yi− g xTi θ 2, (2.16)

(23)

g(·) is chosen as the standard normal or logistic distribution functions, respectively. These criteria exhibit good finite sample performance in models with homoscedastic errors (see also Section 2.6).

In the presence of heteroskedasticity or misclassification, the performance of II esti-mators with these standard auxiliary criteria could however deterioriate. Therefore, we introduce some non-standard auxiliary criteria that have a potential to fit the data devi-ating from the standard homoscedastic model better than the ordinary LS. In particular, we propose two auxiliary criteria that mimic the behavior of LNLLS and SMS estimators in finite samples since the LNLLS and SMS estimators are, as smoothed versions of the maximum score estimator by Manski (1975), robust to any form of heteroskedasticity and possibly to misclassification (see the discussion in Lee, 1992, for the ordered-choice models) since they impose only the conditional median restriction on the distribution of the error terms. As in the case of LNNLS and SMS estimators, we impose scale normalization by |θk| = 1, but the intercept does not have to be normalized and can be estimated along

with other parameters. An analog of LNLLS leads to the nonlinear least-squares (NLS) auxiliary criterion defined by

1 n n X i=1  yi− g  xT i θ c 2 , (2.17)

where c > 0 is a fixed constant now and |θk| = 1. The main difference between (2.17) and

LNLLS is that the former is a standard parametric estimator with the convergence rate n−1/2 due to a fixed c, whereas the latter is a slowly converging semiparametric estimator because the LNLLS involves the denominator cn that converges to 0 with an increasing sample size, cn→ 0 as n → ∞.

Considering now the auxiliary criteria based on the absolute-value loss function, an analog of SMS leads to the nonlinear least-absolute-deviations (NLAD) auxiliary criterion defined by 1 n n X i=1 yi− g  xT i θ c  , (2.18) or equivalently, 1 n n X i=1 (1 − 2yi) g  xT i θ c  , (2.19)

(24)

minimizes the asymptotic variance of the II estimator; see Section 2.4 for details.

Last but not least, since the simulated responses (2.14) are obtained by a kernel estimate for which uniform convergence is established over compact sets, it might be necessary to introduce a trimming scheme into the auxiliary criteria. Furthermore, if the local constant estimator is employed, then it can help to overcome the boundary bias problems. If ˜X ⊂ X is a compact subset of the support X of the regressors and τ (xi) = I

 xi∈ ˜X

 , the trimming can be introduced in (2.17) and (2.19) as

1 n n X i=1 τ (xi)  yi− g  xT i θ c 2 , (2.20) 1 n n X i=1 τ (xi) (1 − 2yi) g  xT iθ c  . (2.21)

Remark 1. Although the SMS and LNLLS estimators allow for general forms of het-eroskedasticity of the error term, they cannot attain n−1/2 convergence rate, exhibit rel-atively large finite-sample bias, and their finite-sample performance is rather sensitive to the employed bandwidth. (Chen and Zhang, 2015). Employing a fixed constant c in (2.20) and (2.21) alleviates these issues at the cost of restricting the form of the heteroskedasticity of the error term which is assumed to be function of a single-index. Essentially, a fixed choice of the constant c that sufficiently approximates the regression function induces a biased estimate of β, but the employed indirect inference methodology then corrects its bias.

2.4

Large sample properties

In this section, we study the asymptotic properties of the feasible indirect inference estima-tor. First, we prove that the feasible II estimator is consistent and asymptotically normal in Sections 2.4.1 and 2.4.2, respectively. Later the choice of the auxiliary criteria and the corresponding parameter c is discussed in Section 2.4.3.

(25)

assume for simplicity that the first derivative w (yi, xi; θ) of the auxiliary objective

func-tion q(yi, xi; θ) has a particular form, at least when deriving the asymptotic distribution

in Section 2.4.2: w (yi, xi; θ) = w1 xTiθ + yiw2 xTiθ xi for some functions w1(·) and

w2(·) satisfying some conventional properties listed later. This choice however includes all

auxiliary criteria in Section 2.3.2 and covers many other possibly auxiliary criteria such as maximum likelihood and nonlinear least squares objective functions of the binary choice probit and logit models.

2.4.1 Consistency of the feasible II estimator

Assumption 1. The parameter space B = ˜B × {1} is a compact subset of Rk−1× {1} and the first k − 1 elements of true parameter value β0 is an element of the inte-rior of ˜B. Similarly, Θ is a compact subset of Rp, p ≥ k − 1, and for any β ∈ B,

b G β, XTβ , FX is an element of the interior of Θ. Additionally, b G β, XTβ , FX



is a continuous and one-to-one mapping as a function of β.

Assumption 2. {(xi, yi)}ni=1is a random sample of (X, Y ) in model (2.1). Moreover, the

support X of X is a compact subset of Rk and there exists at least one continuous regressor Xk with its coefficient normalized to βk= 1.

Assumption 3. The true parameter β0 ∈ B uniquely satisfies P (Y = 1|X) = P (Y = 1|XTβ0).

(26)

2.1). In the presence of discrete explanatory variables, P Y = 1 XTβ0= t should not be periodic and some additional constraints on the support of regressors has to be introduced (cf. Ichimura, 1993, Assumption 4).

Assumption 4. Denoting the density of XTβ by fβ, let fβ and G (β, t) satisfy the

follow-ing conditions:

(i) infβ∈B,t∈T (X ) fβ(t) > 0, infβ∈B,t∈T (X ) G (β, t) > 0, and supβ∈B,t∈T (X ) G (β, t) < 1,

where T (X ) =t ∈ R : ∃x ∈ X , β ∈ B s.t. t = xTβ is a finite union of convex and compact subsets of R .

(ii) For all β ∈ B, fβ(t) and G (β, t) are twice continously differentiable in t with

uni-formly bounded derivatives in both parameters. Furthermore, fβ(t) and G (β, t) are

twice continuously differentiable in β with bounded derivatives on T (X ).

Similar to Assumption 3 of Rothe (2009), Assumption 4 imposes smoothness conditions on fβ(t) and G (β, t), which are required to establish the uniform convergence of ˆGn(β, t)

and its derivatives.

Assumption 5.

(i) The kernel function K : R → h0, 1i is a twice differentiable integrated kernel func-tion such that lim

s→−∞K (s) = 0 , lims→∞K (s) = 1, s→−∞lim sK(s) = 0 and lims→∞s[1 −

K(s)] = 0. The first derivative K0(s) = ∂K (s) /∂s is a bounded density func-tion and ´−∞+∞|s|2K0(s)ds is finite. Furthermore, for some Λ < ∞ and L < ∞:

∂(j)K (s) /∂sj ≤ Λ if |s| ≤ L and ∂(j)K (s) /∂sj ≤ Λ |s| −v for some v > 1 if |s| > L, for j = 1, 2. Additionally, the bandwidth λn> 0 satisfies limn→+∞n1/2λn= 0.

(ii) The kernel function K : R → R is a non-negative second-order kernel satisfying the Lipschitz condition on its compact support. Additionally, the bandwidth hn > 0

satisfies limn→+∞n1/4hn= 0.

Assumption 5(i) imposes conditions on the integrated kernel function K (·) along with inte-grability condition on its derivatives similar to Hansen (2008) and Escanciano et al. (2014). This condition is satisfied by the standard normal distribution function and can be changed into the functions based on compactly supported kernel functions K0 satisfying the Lip-schitz condition. Furthermore, the smoothing parameter λn is assumed to be negligible

(27)

feasible II estimator. Moreover, Assumption 5(ii) demands some conventional restrictions on the kernel function, while it also imposes undersmoothing. Although undersmoothing is not needed to establish the consistency of the feasible II estimator, it is required for its asymptotic normality as we need to obtain a uniform Bahadur representation of the non-parametric kernel estimate ˆGn β0, t with a negligible remainder term of order oP n−1/2



for n → ∞ uniformly in t ∈ T  ˜X. Alternatively, one can use bias-reducing higher-order kernels or employ local polynomial estimators.

For proving the consistency of the proposed estimator, the identification assumption for the auxiliary estimator and two high level assumptions are needed. Let us use the shorthand notation b (β) for b G β, XTβ , FX



for the moment and define Q (Ys(β), X; b (β)) ≡ E [q (Ys(β), X; b (β))].

Assumption 6.

(i) Q (Ys(β), X; b (β)) − supkθ−b(β)k>δQ (Ys(β), X; θ)



> 0 for all β ∈ B and for any δ > 0. (ii) sup (β,θ)∈B×Θ |Qn(Ys(β), X; θ) − Q(Ys(β), X; θ)| = op(1). (iii) sup (β,θ)∈B×Θ Qn( bY s(β), X; θ) − Q n(Ys(β), X; θ) = op(1).

Assumption 6(i) imposes the global identification of b (β) by the auxiliary estimator. If the auxiliary estimator is misspecified, one can empirically check it by looking for the singularity of the second derivative of Q(Ys(β), X; θ), which would indicate the failure of Assumption 6(i) (White, 1981; White, 1982). Further, Assumption 6(ii) imposes uniform convergence of the infeasible auxiliary objective function and Assumption 6(iii) guarantees that the difference between the infeasible and feasible auxiliary objective functions is negli-gible uniformly in the parameters. Those two assumptions are two high-level assumptions on the auxiliary objective functions, which characterize some minimal requirements on oth-erwise completely general auxiliary criteria and are satisfied for large classes of functions including the examples given in Section 2.3.2 as we verify in Appendix 2.C.1. Finally, note that despite the fact that Assumptions 6(i) and 6(ii) are, strictly speaking, formulated for the auxiliary estimation based on the simulated data, they guarantee also the consis-tency of bθnbased on the random sample {yi, xi}ni=1since yi can be viewed as an infeasible

simulated response: yi = I G(β0, xTi β0) ≥ ui, where ui∼ U (0, 1).

(28)

Theorem 1. Let Ω be a positive definite p × p matrix, ˆΩn→ Ω in probability as n → ∞,

and S ∈ N+ be a fixed number of simulated samples. Under Assumptions 1– 6, bβnF II is a consistent estimators of β0, bβnF II → β0 in probability as n → ∞.

Proof: See Appendix 3.A.2

2.4.2 Asymptotic normality of the feasible II estimator

Assumptions stated in Section 2.4.1 are sufficient for the consistency of the proposed fea-sible II estimator as shown in Theorem 1. To derive the asymptotic distribution of the estimator, we however need additional assumptions characterizing primarily the derivatives of the auxiliary objective functions.

Assumption 7. The auxiliary objective function q(Ys(β), X; θ) is twice continuously

dif-ferentiable with respect to θ. Let w (Ys(β) , X; θ) denote the derivative of q(Ys(β), X; θ) with respect to θ, and w (Ys(β) , X; θ) admit the following form: w (Ys(β) , X; θ) =

w1 XTθ + Ys(β)w2 XTθ X for some bounded functions w1 and w2 on X and

Θ. Additionally, let v1 and v2 denote the derivatives of w1 and w2, respectively, where

v1 and v2 are bounded and Lipschitz on their support.

The main condition in Assumption 7 is the form imposed on the auxiliary criterion, which enables us to explicitly derive the asymptotic variance of the proposed feasible II estimator. In particular, consider

Eτ (X) w1 XTθ + Ys(β)w2 XTθ X = E τ (X) w1 XTθ + I G β, XTβ ≥ U  w2 XTθ X

= Eτ (X) w1 XTθ + G β, XTβ w2 XTθ X ,

(2.22)

which follows from the fact that U is independently distributed uniform random variable. Since the expectation (2.22) is now linear in G β, XTβ which is replaced by a nonpara-metric kernel estimate to construct the feasible II estimator, the effect of nonparanonpara-metric estimation will be quantified as in Chen et al. (2003) and Ichimura and Lee (2010). This assumption could be relaxed so that w (Ys(β), X; θ) satisfies the Lipschitz condition and

we could establish the asymptotic normality of the proposed estimator along with some minor modifications in the proofs.

(29)

estimator along with the derivative of the binding function (e.g., see Gourieroux et al. 1993). Let v (Ys(β) , X; θ) denote the second derivative of q(Ys(β), X; θ) with respect to θ and ∂βG β0, XTβ0 denote the first derivative of G β0, XTβ0 with respect to β. We

define the following matrices:

Σs= E  τ (X) w Ys β0 , X; θ0 τ (X) w Ys β0 , X; θ0T Js= −Eτ (X) v Ys(β0), X; θ0  Γ = Eτ (X) w2 XTθ0 X ∂βG β0, XTβ0  Ks= E  τ (X) w Yr β0 , X; θ0 τ (X) w Ys β0 , X; θ0T Σq¯= E h ¯ q Y, X; β0, θ0 ¯q Y, X; β0, θ0T i Σw = E2G β0, XTβ0  G β0, XTβ0 − 1 E τ (X) w2 XTθ0 X XTβ0 τ (X) w2 XTθ0 XT , where ¯q Y, X; β0, θ0 = Y − G β0, XTβ0 E τ (X) w2 XTθ0 X XTβ0  and r, s = 1, . . . , S, r 6= s, noting that the defined matrices do not depend on the indices s and r. Matrices Σs and Js are the variance and the derivative of the population first-order

con-ditions, respectively. Furthermore, Ks represents the covariance between the first-order

conditions of the simulated paths while Σq¯ and Σw reflect the covariance between the first-order conditions due to the use of nonparametric estimator ˆGn β0, XTβ0.

Assumption 8. Let fX|XTβ0 denote the density function of X conditional on XTβ0. As

a function of index XTβ0, f

X|XTβ0 is twice continously differentiable with uniformly

bounded derivatives on T (X ) for all X ∈ X .

Assumption 9. Matrix Jsis non-singular and the derivative of the binding function Ds =

Js−1Γ has the full column rank.

Assumption 8 is imposed to ensure the existence of Σq¯and Γ as they involve expectations

conditional on the index XTβ0, which commonly appear in the asymptotic results for semiparametric estimators of single-index models (e.g., Theorem 4.1 and 4.2 in Xia, 2002). In particular, it is due to the fact that the derivative of ˆGn β0, XTβ0 with respect to

β converges in probability to ∂XTβ0G β0, XTβ0



X − EX

XTβ0

as n → ∞ (e.g., Lemma 5.6 in Ichimura, 1993). The matrices in Assumption 9 are the standard matrices in the indirect estimation (see Gourieroux et al., 1993) except for Σq¯and Σw, which reflect the effect of the nonparametric estimation of G β, XTβ.

(30)

feasible II estimator can be derived.

Theorem 2. Let Ω be a positive definite p × p matrix, ˆΩn→ Ω in probability as n → ∞,

and S ∈ N+ be a fixed number of simulated samples. Under Assumptions 1–9, bβnF II is asymptotically normal: √ nβbnF II− β0  → N (0, V (S, Ω)) (2.23) in distribution as n → +∞, where V (S, Ω) = DsTΩDs −1 DsTΩJs−1  1 + 1 S  (Σs− Ks) + Σq¯+ Σw  Js−1ΩDs DTsΩDs −1 .

Proof: See Appendix 3.A.2

The asymptotic variance derived in Theorem 2 depends on a number of matrices. First, the matrices Σs, Ks, and Jscan typically be estimated by taking the corresponding sample

averages. Turning our attention to the estimation of Σq¯, we can estimate it by

1 n n X i=1 τ (xi)  yi− ˆGn  xTi βbnF II, bβF IIn  ˆEn h w2  xTi θbn  x T i βbnF II i , (2.24) where ˆ En h w2  xTi θbn  x T i βbnF II i = n P j=1 Kh˜n  xTjβbnF II− xTi βbnF II  w2  xTjθbn  n P j=1 K˜hn  xT jβbnF II− xTi βbnF II  (2.25)

with some bandwidth sequence ˜hn→ 0 as n → ∞. Lastly, Eτ (X) w2 XTθ0 X ∂βG β0, XTβ0



can be estimated by again taking sample averages provided that ∂βG β0, XTβ0 is replaced by a consistent estimator. Note that ∂βG β0, XTβ0 = ∂XTβ0G β0, XTβ0



X − EX

XTβ0, hence, ∂XTβ0G β0, XTβ0 can be estimated by the standard one-dimensional local linear

estimation and EX XTβ0 can be estimated similar to (2.25).

Further, the weighting matrix Ω and its estimate ˆΩn can be chosen by the researcher.

Given the form of V (S, Ω) and assuming invertibility of 1 + S−1 (Σs− Ks)+Σq¯+Σw, the

asymptotic variance is however minimized for Ωopt = Js 1 + S−1 (Σs− Ks) + Σq¯+ Σw

−1 Js

(Gourieroux et al., 1993, Proposition 4), which leads to

V (S, Ωopt) =ΓT 1 + S−1 (Σs− Ks) + Σq¯+ Σw

−1 Γ

−1

(31)

Since the optimal weighting matrix Ωopt is not known a priori in the feasible case, a two-step procedure has to be used in order to minimize the asymptotic variance. First, an initial estimator ˆβn0 of β0 has to be obtained using a known weighting matrix, for example, Ω = I. Next, the optimal weighting matrix Ωopt has to be estimated using ˆβ0n and the feasible estimation can be then performed once again using the estimate ˆΩoptn of Ωopt.

Finally, let us note that the asymptotic variance is obviously decreasing with the number S of simulated samples.

Remark 2. The ability to estimate the elements of the asymptotic variance and using the smoothed version of the simulated response (2.14) enable us to compute estimates using derivative-based optimization procedures as discussed in Bruins et al. (2018). More specif-ically, we can employ Gauss-Newton type of algorithm, and for a given iteration βk, obtain the new iteration βk+1 by

βk+1 = βk− ˆHT (βk) ˆΩnH (βˆ k) −1 ˆ HT(βk) ˆΩn θbn− 1 S S X s=1 b θsn(βk) ! , ˆ H (βk) = 1 S S X s=1 ∂ bθsn(βk)T ∂β ! = 1 S S X s=1 ˆ Js(βk) !−1    1 S S X s=1 ∂Qn  b Ys(βk) , bθns(βk) T ∂β   , where ˆJs(βk) = n−1 n P i=1 τ (xi) v  ˆ yis(βk) , xi; bθsn(βk) 

. It is easy to obtain ˆJsby taking

sam-ple averages. Although we can obtain the analytical derivative ∂Qn

 b

Ys(βk) , bθns(βk)

T /∂β, it is computationally easier to approximate it by the consistent estimate of its asymptotic limit 1 nS S X s=1 n X i=1 h τ (xi) w2  xTi θbsnk)  ∂xT iβk ˆ Gn βk, xTi βk  xi− ˆEnxi xTi βk  xTi i, where ∂xT iβk ˆ

Gn βk, xTi βk will be a by-product once we estimate G βk, xTiβk by local-linear

smoothing and ˆEnxi

xTi βk can be again obtained by a one-dimensional kernel smoothing.

2.4.3 Choice of auxiliary criterion

(32)

ideally be equal to the distribution function of the latent error term in model (2.1) for the nonlinear least squares (2.20)if the latent error term is homoskedastic. Although the error distribution is not known, it seems reasonable to choose g equal to a distribution function, which might be representative for the data. If there is a lack of any prior knowledge, we suggest using g equal to the normal or logistic distribution functions as in Horowitz (1992) or Blevins and Khan (2013).

For a given choice of g, a user needs to determine the constant c in (2.20) or (2.21). The optimal choice of c should minimize the asymptotic variance of the estimates. We thus suggest to estimate c in the following way:

ˆ

cn= arg minc>0Vˆn(g, c; S, Ω), (2.26)

where ˆVn(g, c; S, Ω) is an estimate of V (S, Ω) with an explicitly stated dependence on

the function g and constant c. Since criterion (2.26) attempts to minimize a consistent estimate ˆVn(g, c; S, Ω) of the asymptotic variance V (S, Ω), ˆcnshould converge to some fixed

c∞= arg minc>0V (g, c; S, Ω) under some regularity conditions, and in moderate and large

samples, the choice of c can be practically independent of the sample size (e.g., contrary to bandwidth in nonparametric estimation). We will demonstrate in Section 2.6 that the resulting estimators are not overly sensitive to the choice of c and using some fixed value such as c = 0.5, that is half of the standard deviation of the regressor with the normalized coefficient, works sufficiently well across various experiments.

Nevertheless, let us demonstrate how the asymptotic standard deviation of the II es-timator with g ≡ Φ depends on c, for example, in the case of the auxiliary eses-timators (2.20) and (2.21) based on NLS and NLAD. We consider two data generating processes: homoscedastic and heteroskedastic probit models yi= I(x1i+ x2i+ εi ≥ 0) with two

vari-ables x1i, x2i ∼ N (0, 1), parameter vector β0 = (0, 1, 1)T, the error term εi ∼ N (0, σi2)

with conditional variance σi2 = 1 or σ2i = 2/3 + xTi β0/3 + (xTi β0)2/62, respectively. The estimates of asymptotic standard deviations1 of the feasible II estimator are depicted

on Figure 2.1. In the homoscedastic case, the estimate of optimal value of c in (2.20) is approximately 1.8, but the increase in standard deviation due to a different choice of c ∈ [0.65, 2] is less than 10%. In the case of auxiliary estimator (2.21), the optimal value of c is around 0.8 (with the corresponding 10%-increase range approximately [0.5, 1.25]). In

1

We estimate the unknown auxiliary parameter vector θ. Furthermore, Σs− Ks is estimated following

Gourieroux et al. (1993). The nonparametric estimates required to obtain the estimates of Γ, Σq¯, and Σw

(33)

0.0 0.5 1.0 1.5 2.0 3 4 5 6 7 c StDe v 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 c StDe v 0.0 0.5 1.0 1.5 2.0 3 4 5 6 7 c StDe v 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 c StDe v

Figure 2.1: The asymptotic standard deviation estimates of the II estimator as a function of constant c in the homoscedastic probit (left) and heteroskedastic probit (right). The II estimator is based on the NLS (top) or NLAD (bottom) criterion and n = 100000, S = 50.

the heteroskedastic case, the dependence of the standard deviation on c is stronger, but the range of c values that do not increase the standard deviation by more than 10% relative to the optimal choice of c is approximately [0.4, 2] and [0.3, 1.25] for the auxiliary criteria (2.20) and (2.21), respectively, despite a very pronounced heteroskedasticity. For the two models and two auxiliary estimators, any value c ∈ [0.65, 1.0] thus guarantees the variance being close to the optimal one irrespective of the auxiliary criterion or model.

2.5

Robustness to misclassification

(34)

misclassi-fied observations (Neuhaus, 1999), especially if misclassimisclassi-fied responses combine with large values of explanatory variables (Croux et al., 2002). While the bias due to misclassifica-tion is the best measure of the impact of misclassificamisclassifica-tion on the estimates, it is difficult to derive analytically for complex methods discussed here. To characterize robustness to misclassification of the semiparametric methods considered in this paper, we therefore use a global measure of an estimator’s robustness to atypical data, be it misclassified obser-vations or mismeasured covariates – the breakdown point which is defined as the smallest fraction of contamination that can make the estimator “completely unreliable” (e.g., in the sense of the estimator’s bias being maximal). Following Croux et al. (2002), we utilize the notion of the additive breakdown point for model (2.1) since the usual replacement breakdown point is not appropriate to study robustness of the estimators for a binary-choice model. Thus, the breakdown point of an estimator ˆβ(Zn) of the slope coefficients

at sample Zn= (xi, yi)ni=1is defined as n= ¯m/(n + ¯m) with

¯ m = min ( m ∈ N0 : inf Z0n+m k ˆβ(Zn+m0 )k = 0 or sup Zn+m0 k ˆβ(Zn+m0 )k = ∞ ) , (2.27)

where Zn+m0 represents samples obtained from Zn by adding m arbitrary observations

(e.g., Čížek, 2008). The asymptotic breakdown point is then  = limn→∞n if the limit

exists. Croux et al. (2002) showed that the MLE of binary-choice logit model breaks down to zero regardless of sample data if only 2k outliers are added. Thus, the breakdown point of MLE is bounded by 2k/(n + 2k) and its asymptotic breakdown point is zero. Let us note that the maximum likelihood estimates do not explode to infinity because of contamination if it is identified (Croux et al., 2002). Since this is also the case of the proposed (auxiliary) estimators due to the form of their objective functions, we concentrate here on the breakdown to zero.

(35)

down, infZ0 n+m

k ˆβ(Zn+m0 )k = 0 in (2.27), the link function is not one-to-one anymore and the II cannot identify parameters β. Hence, the breakdown of the II estimator and the breakdown of the auxiliary estimator will be considered equivalent despite the fact that the II estimator does not satisfy the condition in (2.27) due to the lack of identification once the auxiliary estimator breaks down (i.e., instead of bias, the variance of the II estimator can become infinite when the auxiliary estimator breaks down). To this end, we will analyze the robustness of the auxiliary estimators in the worst case scenario and show, for example, that the II estimator based on the auxiliary LS criterion (Westerlund and Hjertstrand, 2014) has the breakdown point equal to k/ (n + k), which converges asymptotically to 0 as in the case of MLE. On the other hand, the auxiliary criteria (2.17) and (2.19) can attain a positive breakdown point even asymptotically.

Before stating the theorem characterizing the breakdown point of the auxiliary esti-mators, note that the semiparametric model (2.7) requires scale normalization, which is assumed to take form βk0= 1 here. The auxiliary estimators (2.17) and (2.19) also involve the scale normalization of one coefficient, but this does not have to be necessarily be the co-efficient θ0k: the following theorem allows that some other coefficient is normalized instead. Furthermore, the derived results depend on the size of the support of the normalized vari-able, which can be chosen by the researcher. We therefore derive the limit results also for the support shrinking to a singleton. Last but not least, as the presence of trimming will certainly improve robust properties, we consider the auxiliary objective functions (2.17) and (2.18) without the compactness of the support of X except the regressor with the normalized coefficient.

Theorem 3. Consider a fixed c > 0 and a given univariate distribution function g : R → [0, 1] such that g(0) = 1/2. Given a sample {yi, xi}ni=1, let us assume that the variable xij

has a support of suppxij ⊆ [CjL, CjU] ⊆ R, j = 1, . . . , k. Under Assumptions 1 and 2, it

holds for the auxiliary estimators of model (2.1) that

1. the linear least-squares estimator breaks down at {yi, xi}ni=1 to 0 if m ≥ k and thus

n≤ k/(n + k), which converges to 0 as n → +∞;

(36)

data points satisfies min θ0∈R   n X i=1  yi− g  θ0+ xij c 2 + m min d1∈{0,1},d2∈{L,U } ( d1− g θ0+ Cjd2 c !)2 −S1n(θ) ≤ m, at any θ ∈ Θ, where S1n(θ) = n P i=1

{yi − g(xTiθ/c)}2; if the support of variable xij

approaches a singleton so that |CjL− CU

j | → 0, the nonlinear least-squares estimator

minimizing (2.17) breaks down at {yi, xi}ni=1 to 0 if the number m of contaminated

data points satisfies

−1 2{n(1 − ¯yn) + S1n(θ)}+ r 1 4{n(1 − ¯yn) + S1n(θ)} 2 + n2y¯ n(1 − ¯yn) − nS1n(θ) ≤ m

at any θ ∈ Θ, and the corresponding asymptotic breakdown point 1satisfies (ei(θ, c) = yi− g(xTi θ/c) and θc0= arg minθ∈ΘE{yi− g(xTi θ/c)}2 )

−1 21 − Eyi+ Ee 2 i(θc0, c) + r 1 41 − Eyi+ Ee 2 i(θc0, c) 2 + Eyi(1 − Eyi) − Ee2i(θc0, c) ≤ 1/(1−1),

3. the nonlinear least-absolute-deviations estimator minimizing (2.18) with the slope coefficient of xij normalized to 1 breaks down at {yi, xi}ni=1 to 0 if the number m of

contaminated data points satisfies

min θ0∈R " n X i=1 yi− g  θ0+ xij c  + m min d1∈{0,1},d2∈{L,U } d1− g θ0+ Cjd2 c ! # −S2n(θ) ≤ m at any θ ∈ Θ, where S2n(θ) = n P i=1 |yi − g(xT

i θ/c)|; if the support of the variable

xij approaches a singleton so that |CjL − CjU| → 0, the nonlinear

least-absolute-deviations estimator minimizing (2.18) breaks down at {yi, xi}ni=1 to 0 if the number

m of contaminated data points satisfies

n

X

i=1

|yi| − S2n(θ) ≤ m

at any θ ∈ Θ, and the corresponding asymptotic breakdown point 2 satisfies Eyi −

E|yi− g(xTi θ0c/c)| = Eyi− E|ei(θc0, c)| ≤ 2/(1 − 2) (again, ei(θ, c) = yi− g(xTi θ/c)

and θ0c = arg minθ∈ΘE|yi− g(xTi θ/c)| ).

(37)

0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 c Breakdo wn point x2 ~ TN(0, 0.1) x2 ~ TN(0, 0.25) x2 ~ TN(0, 0.5) x2 ~ TN(0, 1) 0.0 0.5 1.0 1.5 2.0 0.00 0.05 0.10 0.15 0.20 0.25 0.30 c Breakdo wn point x2 ~ TN(0, 0.1) x2 ~ TN(0, 0.25) x2 ~ TN(0, 0.5) x2 ~ TN(0, 1)

Figure 2.2: Breakdown point of the II estimator as a function of constant c in the ho-moscedastic probit for the NLS auxiliary criterion (left) and for the NLAD auxiliary cri-terion (right), n=100000.

of the normalized variable shrinks to a singleton. Hence, they provide another guide in choosing this tuning constant. To exemplify the expressions in Theorem 3, we evaluated the asymptotic breakdown point at a simple probit model yi = I(x1i+ x2i + εi ≥ 0),

where the sample size is set to n = 100000, all random variables x1i, x2i and εi follow

the standard normal distribution; g ≡ Φ is used. There are two sets of results depicted for c ∈ (0, 2) in Figure 2.2: the straight dashed line depicts the asymptotic breakdown points derived for the limit case of the support consisting of a single point, whereas the other lines are obtained under the normalization of the coefficient of x2i, which follows a normal distribution T N (0, σ22) truncated at the 0.00005 and 0.99995 quantiles of N (0, 1) with σ2 ∈ {0.1, 0.25, 0.5, 1}.

In the case of the NLS auxiliary criterion (2.17), the asymptotic breakdown point when the support shrinks to a singleton is around 10%. As the support of x2i increases, the

breakdown point slowly decreases and reaches values 6–9% for c ∈ (0.5, 1). In the case of the NLAD criterion (2.18), the asymptotic breakdown point when the support shrinks to a singleton is much higher and reaches to 20%. As the support of x2iincreases, the breakdown point decreases and reaches values around 15% at σ2 = 0.25 and approximately 10% for σ2 = 1. Let us recall here that (the support of) the normalized explanatory variable can

(38)

2.6

Monte Carlo simulations

After deriving the asymptotic properties of the proposed II estimator, it is important to check whether the asymptotic results correspond to the finite-sample performance of the proposed estimators. Moreover, the robust properties of the auxiliary estimators were established at the extreme case of a breakdown of an estimator and the practically more relevant measure – bias due to misclassification – can only be evaluated empirically. To this end, the II methods are now compared with some existing estimators by Monte Carlo experiments. We first describe the simulation setting in Section 2.6.1 and later discuss the simulation results in Section 2.6.2.

2.6.1 Simulation design

The data-generating process is similar to the ones considered in previous studies (e.g., Klein and Spady, 1993 and Chen, 2000). We consider first the binary-choice model without misclassification:

yi = I(β0+ x1i+ x2i+ εi ≥ 0) (2.28)

with the intercept β0, two independent regressors x1i, x2i ∼ N (0, 1), and normally

dis-tributed errors εi ∼ N (0, σ2i); as the results are qualitatively rather similar for different

intercepts, the case of β0 = 0 and thus the parameter vector β0 = (0, 1, 1)T is reported.

Results for other error distributions, such as the logistic one, are similar to the presented ones as well.

Later, the probit model with misclassified responses is considered. In that case, the ob-servations with correct responses follow model (2.28) with x1i, x2i∼ N (0, 1), εi∼ N (0, 1),

whereas the 5% or 10% of misclassified observations follow the model

˜

yi = I(β0+ ˜x1i+ ˜x2i+ εi < 0)

with ˜x1i, ˜x2i ∼ N (±L, 0.2) or ˜x1i ∼ N (±L, 0.2), ˜x2i = 0 and L ∈ {1.5, 5}. The first

set of covariates is chosen in such a way that the misclassified observations mimick to a large extent an index-based heteroskedasticity (i.e., the intensity of misclassification is approximately proportional to the index x1i+ x2i), whereas the latter set of covariates

introduces misclassification with the intensity that is not proportional to the value of the index x1i+ x2i.

(39)

stan-dard probit maximum likelihood estimator (MLE) constructed for normal homoscedastic errors; (ii) the Klein and Spady (KS; 1993) estimator; and (iii) the II estimator based on three auxiliary estimators: the least squares (II-LS), the nonlinear least squares (2.17) with g = Φ and c ∈ {0.5, 1.8, 1.0} (II-NLS), and the nonlinear least absolute deviations (2.18) with g = Φ and c ∈ {0.5, 0.8, 1.0} (II-NLAD). The choices of constant c = 1.8 and c = 0.8 correspond to the variance-minimizing choice under the homoscedastic probit model, whereas c = 0.5 and c = 1.0 are reported to show how (in)sensitive the method is to the choice of this parameter. The smoothing bandwidth is set to λn= cnn−0.55, where cnis

the sample standard deviation of ˆGn β0, XTβ0. The unreported simulations indicate that

the results are not very sensitive to different values of λnor even if no smoothing is applied (i.e., for K(t) = I(t ≥ 0)). Finally, note that the II estimator with the auxiliary LS is not equivalent to Westerlund and Hjertstrand (2014) since we use smooth simulated responses and all auxiliary parameters instead of just k − 1 slope coefficients. All II estimators use S = 50 simulated samples and use the identity weighting matrix Ω = I. All nonparametric estimators of G β, xTi β used within the II and KS procedures (the latter being computed by the code available on Rothe (2009)’s website) are based on the Nadaraya-Watson es-timator, have always the same bandwidth, and employ the Gaussian kernel. We use the Nelder-Mead and Brent methods for the auxiliary and indirect estimation, respectively.

The simulation results are obtained using 1000 simulations for sample sizes n = 200, 400, and 600. We report the bias and root mean squared error (RMSE) of the slope estimate of variable x1isince the coefficient of x2i is normalized to 1 for identification.

2.6.2 Simulation results

The first set of results is obtained for three sample sizes n ∈ {200, 400, 600} in the ho-moscedastic probit, that is, εi ∼ N (0, σi2) with σi= 1 (see Table 2.1). The MLE estimator

serves here as a parametric benchmark, and given the homoscedasticity, provides the most precise estimates. Considering the semiparametric estimators, all II methods perform sim-ilarly to the semiparametrically efficient KS estimator; the best performance among the semiparametric estimators can be attributed to II-LS followed by II-NLS. Furthermore, it is worth noticing the insensitivity of the RMSEs with respect to constant c.

The remaining results are presented only for n = 400. First, the heteroskedastic probit is considered with εi∼ N (0, σ2

i) and σ2i = (1+(xTi β)2)2/4 and σi2 = 2/3 + xTi β/3 + (xTi β)2/6

2

(40)

Table 2.1: Biases and root mean squared errors in the homoscedastic probit model for sample size n =200, 400, and 600.

Sample n = 200 n = 400 n = 600

Estimate c∗ Bias RMSE Bias RMSE Bias RMSE

Probit 0.006 0.172 0.009 0.118 0.001 0.093 KS 0.013 0.215 0.012 0.140 0.004 0.108 II-LS 0.016 0.185 0.012 0.122 0.002 0.095 II-NLS 0.5 0.009 0.206 0.009 0.144 0.005 0.115 Opt 0.041 0.212 0.032 0.135 0.020 0.103 1.0 0.030 0.202 0.023 0.133 0.014 0.103 II-NLAD 0.5 -0.009 0.252 0.002 0.166 -0.004 0.129 Opt -0.013 0.259 0.002 0.156 -0.004 0.125 1.0 0.000 0.321 0.002 0.163 -0.005 0.130 * Values ‘Opt’ of constants c = 1.8 for NLS and c = 0.8 for NLAD.

Table 2.2: Biases and root mean squared errors in the homoscedastic and heteroskedastic probit models for sample size n = 400.

Variance σ2i 1 14(1 + (xTi β0)2)2 23 +13xTi β0 +16(xTi β0)2

Estimate c∗ Bias RMSE Bias RMSE Bias RMSE

Probit 0.009 0.118 0.033 0.241 0.005 0.138 KS 0.012 0.140 0.009 0.140 0.000 0.125 II-LS 0.012 0.122 0.031 0.214 0.010 0.129 II-NLS 0.5 0.009 0.144 0.036 0.124 0.017 0.120 Opt 0.032 0.135 0.031 0.121 0.024 0.119 1.0 0.023 0.133 0.048 0.150 0.028 0.121 II-NLAD 0.5 0.002 0.166 -0.002 0.123 -0.005 0.129 Opt 0.002 0.156 -0.002 0.123 -0.003 0.139 1.0 0.002 0.163 0.000 0.192 -0.005 0.154

(41)

Table 2.3: Biases and root mean squared errors in the homoscedastic probit model (n = 400) under 10% index-based misclassification at ˜x1i, ˜x2i∼ N (±L, 0.2).

Misclass. None At L = ±1.5 L = ±5.0

Estimate c∗ Bias RMSE Bias RMSE Bias RMSE

Probit 0.009 0.118 0.080 0.475 -1.791 26.340 KS 0.012 0.140 0.018 0.162 0.003 0.146 II-LS 0.012 0.122 0.023 0.195 0.014 0.140 II-NLS 0.5 0.009 0.144 0.005 0.148 0.004 0.164 Opt 0.032 0.135 0.041 0.199 0.024 0.164 1.0 0.023 0.133 0.017 0.136 0.014 0.153 II-NLAD 0.5 0.002 0.166 0.005 0.177 -0.005 0.176 Opt 0.002 0.156 0.001 0.165 -0.005 0.169 1.0 0.002 0.163 0.002 0.193 -0.008 0.172 * Values ‘Opt’ of constants c = 1.8 for NLS and c = 0.8 for NLAD.

optimal values of tuning constant c obtained under homoscedasticity, but compute and use its estimate of optimal value in each scenario. While the heteroskedasticity is a function of the index, the first choice is symmetric in xTi β around zero, whereas the latter one is asymmetric. In these cases, MLE is no longer optimal and exhibits some bias. In the asym-metric case, the ranking of the methods is similar to the homoscedastic probit, that is, the KS and the II estimators do not substantially differ in terms of RMSE. In the symmetric case, the RMSEs of the II-LS and II-NLS, II-NLAD with the largest choice of c = 1.0, become very large due to the lack of variation of the auxiliary estimates. Neglecting these estimators, the other estimators are not affected by this and perform rather similarly: best performance is observed for II-NLS followed by II-NLAD, while KS exhibits slightly larger RMSE.

The remaining simulation results consider the probit model with 5% and 10% misclas-sified observations that are either close to the rest of the data (L = ±1.5) or far from the rest of the data (L = ±5). If the misclassification intensity is approximately a function of the true index x1i+ x2i, which is the case for ˜x1i, ˜x2i ∼ N (±L, 0.2) in Table 2.3, MLE

estimation is substantially biased, but the semiparametric single-index methods are not substantially affected by this as expected. (Therefore, only results for the 10% misclas-sification rate are reported.) On the other hand, the misclasmisclas-sification generally affects P (yi = 1|x1i+ x2i) if the probability of misclassification is no longer a function of the

index x1i+ x2i, which is the case of ˜x1i ∼ N (±L, 0.2), ˜x2i= 0 in Table 2.5. In this case,

(42)

Table 2.4: Biases and root mean squared errors in the homoscedastic probit model (n = 400) under 5% general misclassification at ˜x1i∼ N (±L, 0.2), ˜x2i= 0.

Misclass. None At L = ±1.5 L = ±5.0

Estimate c∗ Bias RMSE Bias RMSE Bias RMSE

Probit 0.009 0.118 -0.271 0.290 -0.835 0.835 KS 0.012 0.140 -0.193 0.279 0.146 0.219 II-LS 0.012 0.122 -0.229 0.257 -0.304 0.403 II-NLS 0.5 0.009 0.144 -0.014 0.178 0.006 0.146 Opt 0.032 0.135 -0.169 0.216 0.028 0.187 1.0 0.023 0.133 -0.139 0.205 0.022 0.150 II-NLAD 0.5 0.002 0.166 -0.010 0.180 -0.004 0.170 Opt 0.002 0.156 -0.031 0.187 -0.004 0.164 1.0 0.002 0.163 -0.042 0.206 -0.004 0.181 * Values ‘Opt’ of constants c = 1.8 for NLS and c = 0.8 for NLAD.

Table 2.5: Biases and root mean squared errors in the homoscedastic probit model (n = 400) under 10% general misclassification at ˜x1i∼ N (±L, 0.2), ˜x2i= 0.

Misclass. None At L = ±1.5 L = ±5.0

Estimate c∗ Bias RMSE Bias RMSE Bias RMSE

Probit 0.009 0.118 -0.486 0.493 -1.002 1.007 KS 0.012 0.140 -0.338 0.416 0.156 0.228 II-LS 0.012 0.122 -0.402 0.414 -0.570 0.698 II-NLS 0.5 0.009 0.144 -0.285 0.452 -0.640 0.880 Opt 0.032 0.135 -0.361 0.381 -0.376 0.541 1.0 0.023 0.133 -0.371 0.393 -0.706 0.837 II-NLAD 0.5 0.002 0.166 -0.036 0.193 0.012 0.253 Opt 0.002 0.156 -0.085 0.221 0.011 0.191 1.0 0.002 0.163 -0.107 0.233 0.013 0.189 * Values ‘Opt’ of constants c = 1.8 for NLS and c = 0.8 for NLAD.

(43)

2.7

Extensions

In this section, we informally discuss two extensions of the II approach established in this paper for binary choice model with i.i.d. data. In particular, we discuss how to generate simulated responses and possible auxiliary criteria for binary choice model with endogenous explanatory variables and binary choice model with dependent data.

2.7.1 Single-index binary choice model with endogenous explanatory variables

Similar to Rothe (2009), we consider the binary choice model Y = I XTβ0+ ε ≥ 0, where ke- dimensional subvector Xe of X is endogenous and assume that there exists an

unobserved control variable V ∈ Rkv such that ε ⊥ X |V and V = v

0(Xe, Z) for some

function v0 and a vector of observed exogenous instruments Z ∈ Rkz. Under such an

assumption, Rothe (2009) showed that E [Y |X, V ] = G β0, XTβ0, V, where G denotes

the conditional distribution of ε given V . As discussed in Rothe (2009), such a control variable exists when Xe= m (Z) + V , E [V |Z ] = 0 for some function m (·). In such a case, V can be consistently estimated as follows: first, we construct an estimator ˆm (Z) of m (Z) and then we can define ˆV = Xe− ˆm (Z). Therefore, provided that the unobserved V is consistently estimated by some estimator ˆV , we can modify (2.14) and (2.13) to generate the simulated responses by

ˆ ysi (β) = K   ˆ Gn  β, xTi β, ˆVi  − us i λn  , ˆ Gn(β, t, v) = n P j=1 Khn  xTjβ − t ˜K˜hn ˆVj− v  yj n P j=1 Khn  xTjβ − t ˜K˜hn ˆVj− v  , where ˜Kh˜n(·) = ˜K  ·/˜hn 

is a kv-dimensional product kernel and ˜hn→ 0 as n → ∞.

As an auxiliary criterion, we can use OLS by adding control function estimates ˆV as regressors, for instance. In particular, let ˆwi = xTi , ˆviT. Then, we can define θbn =  n P i=1 ˆ wiwˆiT −1 n P i=1 ˆ

wiyi for the sample and bθsn(β) =

 n P i=1 ˆ wiwˆTi −1 n P i=1 ˆ

wiyˆsi (β) for the

(44)

For this setup, the theory presented in the Appendices can accomodate the proof of consis-tency of the indirect inference estimator (2.29) with minor modifications. Concerning the asymptotic normality, the effect of nonparametric estimation of G β0, XTβ0, V requires an additional proof as it now includes a possibly nonparametric estimate ˆV of V as an argument.

2.7.2 Single-index binary choice model with dependent data

De Jong and Woutersen (2011) studied the dynamic binary-choice model and established the asymptotic properties of the MLE and the SMS in a time series context under the assumption that (xi, εi) is a sequence of strictly stationary strong mixing random variables.

For the sake of simplicity in the exposition, we discuss the adaptation of the II to the dynamic binary choice model of order 1:

yi = I ρ0yi−1+ xTi γ0+ εi ≥ 0 , i = 1, ..., n.

The simulated responses are now defined recursively. For a given initial value y0 ∈ {0, 1},

we generate ˆys1(β) for s = 1, ..., S as follows. Let ρ, γTT = β, ˜x1 = y0, xT1

T

and define ˜

xT1β = ρy0+ xT1γ. Then, we obtain

ˆ y1s(β) = K ˆGn β, ˜xT 1β − us1 λn ! , ˆ Gn β, ˜xT1β = n P j=2 Khn  ρyj−1+ xTjγ − ˜xT1β  yj n P j=2 Khn  ρyj−1+ xTjβ − ˜xT1β  .

Next, we define ˜xT2β = ρˆy1s(β) + xT2γ to simulate ˆy2s(β) by

ˆ y2s(β) = K ˆGn β, ˜xT 2β − us2 λn ! .

Hence, exploiting the recursive relationship ˆys

i(β) can be generated for i = {3, ..., n} by

ˆ yis(β) = K ˆGn β, ˜xT i β − usi λn ! , where ˜xTi β = ρˆyi−1s (β) + xTi γ.

(45)

proofs established in this paper can be adapted to establish consistency of the indirect inference estimator. However, we conjecture that the proof of asymptotic normality is not trivial as the simulated responses are not only nonparametrically estimated but also recursively defined. We leave this for future research.

2.8

Conclusion

(46)

2.A

Proofs of the limit theorems

We first establish the asymptotic properties of the auxiliary estimator. Later, we will prove the consistency and asymptotic normality of the feasible II estimator. To proceed with the proofs, we need to develop some notations. For notational convenience, we suppress X in the auxiliary objective function and its derivatives. We denote the sample auxiliary objective functions by Qn(Y, θ) = n−1

n

P

i=1

q (yi, xi; θ). When the responses are simulated

at β ∈ B and the function is evaluated at θ ∈ Θ, we use Qn(Ys(β), θ) and Q n

 b

Ys(β) , θ

while the expectations of those quantities are denoted by Q (Ys(β), θ) and QYbs(β) , θ 

, respectively. Specifically, recall that U is independent from X and we write

Q (Ys(β), θ) = ˆ

τ (x) q I G(β, xTβ) ≥ u , x; θ dFX(x) dFU(u) , (2.30)

where FX and FU denote the distribution function of X and U , respectively. Since ˆ

Gn β, xTβ = ˆGn β, xTβ : {(xi, yi)}ni=1 is a random function based on the observations

for a given x, we define Q 

b

Ys(β) , θ 

following van der Vaart (2000, page 279) as the expected value with respect to X and U , that is, the expectation is not taken with respect to ˆGn xTβ, β: QYbs(β) , θ  = E " τ (X) q K ˆGn β, X Tβ − u λn ! , X; θ !# (2.31) = ˆ τ (x) q K ˆGn β, x Tβ − u λn ! , x; θ ! dFX(x) dFU(u) . (2.32)

For the sake of simplicity, we assume that ˆGn β, XTβ does not depend on xi, for

ex-ample, as in the leave-one-out Nadaraya-Watson estimator. Similarly, the first and sec-ond derivatives of Qn(·, θ) with respect to θ are denoted by Wn(·, θ) and Vn(·, θ) with

their expectations W (·, θ) and V (·, θ). To further simplify notation, binding function b G β, XTβ , FX is denoted by b (β) and we write b β0 = θ0 at β = β0. Finally, we

use C, C0, C1, C2 < ∞ as generic constants that might assume a different value in each

appearance and limiting expressions are meant for n → ∞ unless stated otherwise. Lemma 1. Suppose that Assumptions 1–6 hold. Then, supβ∈B

θb s n(β) − b (β) = op(1) as n → ∞.

(47)

0. It follows that for any given δ > 0 P sup β∈B θb s n(β) − b (β) > δ ! ≤ P sup β∈B  Q(Ys(β), b (β)) − Q(Ys(β), bθsn(β))  ≥  ! (2.33) for some  > 0. Hence, it suffices to show that the right-hand side of (2.33) is op(1) as n → ∞: sup β∈B  Q (Ys(β) , b (β)) − Q  Ys(β) , bθsn(β)  ≤ sup β∈B  Q (Ys(β) , b (β)) − Qn  b Ys(β) , bθsn(β)  + sup β∈B  Qn  b Ys(β) , bθns(β)  − Qn  Ys(β) , bθns(β)  (2.34) + sup β∈B  Qn  Ys(β) , bθns(β)− QYs(β) , bθsn(β) (2.35) ≤ sup β∈B  Q (Ys(β) , b (β)) − Qn  b Ys(β) , bθsn(β)  + op(1) ≤ sup β∈B  Q (Ys(β) , b (β)) − Qn  b Ys(β) , b (β)+ op(1) ≤ sup β∈B (Q (Ys(β) , b (β)) − Qn(Ys(β) , b (β))) (2.36) + sup β∈B  Qn(Ys(β) , b (β)) − Qn  b Ys(β) , b (β)  (2.37) + op(1) .

The first inequality follows from the triangle inequality. Furthermore, the second inequal-ity is obtained from the Assumptions 6(iii) and 6(ii) applied on (2.34) and (2.35), respec-tively. The third inequality is a consequence of the definition of bθsn(β) as the maximizer of Qn

 b

Ys(β) , θ. Finally, the desired claim follows because (2.36) and (2.37) are also o p(1)

by Assumption 6(ii) and Assumption 6(iii), respectively. 

Lemma 2. Suppose that Assumptions 1–8 hold. Then it holds for n → ∞ that

Referenties

GERELATEERDE DOCUMENTEN

Het hoeft na het voorafgaande niet meer te verbazen, net zo min als het feit dat Schlegel op zijn beurt de Reden van zijn vriend recenseerde in Athenäum, het tijdschrift waarvan

ber of credit points the first-year GPA is based on, the dummies for female and Dutch, the age at the start of the second year of a student’s program, the number of second

De dagelijkse stijging in voeropname werd ook niet beïnvloed door de opname van voer tijdens de zoogperiode. In de analyse van de dagelijkse stijging van de voeropname zijn de

The development of map comparison methods in this thesis has a particular focus; it is aimed at the evaluation of geosimulation models and more specifically the calibration

Onder het motto Dood hout leeft is met veel succes het laten liggen van dood hout gepropageerd als belang- rijke bron voor de biodiversiteit in onze bossen.. Nu het laten liggen

Bij zeer afwijkende voederregimes met tekorten kunnen koeien ook gevoelige melk geven, maar dan moet gedacht worden aan bijvoorbeeld tekorten in energievoorziening en moeten de

een taakstellende prognose voor de ontwikkelinq van het inwonertal van het plangebied als geheel, waarbii voor elk jaar van de prognoseperiode een bepaald

Recently in [ 15 ], a compensation scheme has been proposed that can decouple the frequency selective receiver IQ imbalance from the channel distortion, resulting in a