Binary choice models with endogenous explanatory variables : a Monte Carlo investigation

(1)

BINARY CHOICE MODELS WITH ENDOGENOUS

EXPLANATORY VARIABLES:

A MONTE CARLO INVESTIGATION

Helen van der Poll

10513671

Abstract

This thesis concerns a theoretical study that compares LPM, IV/TSLS, probit, IV probit and the Rivers Vuong approach (1988) as methods of estimation for a binary choice model with endogenous explanatory variables. The models vary in in instru-ment strength and level of correlation and the Monte Carlo outcomes are compared based on estimate bias. The results suggest that when instrument strength and cor-relation are low, methods that ignore endogeneity perform better than methods that try to compensate for it. As endogeneity increases, IV probit and the RV approach return the least biased estimates.

Under the supervision of Dr. E. Aristodemou Faculty of Economics and Business

University of Amsterdam June 2018

(2)

(3)

(4)

(5)

Declaration of authenticity

This document is written by Helen Joke van der Poll, who declares

to take full responsibility for the contents of this document.

I declare that the text and the work presented in this document are

original and that no sources other than those mentioned in the text

and its references have been used in creating it. The Faculty of

Economics and Business is responsible solely for the supervision of

completion of the work, not for the contents.

(6)

(7)

1 Introduction

Data is the new gold. Econometricians use data and statistical techniques like re-gression analysis to analyse data and determine probability relationships. They make predictions about the future and test these predictions to reality. The out-comes of those analyses can be used for a wide range of purposes like strengthen conclusions of scientific research, determine trends or possibly election interference. But what if these established relationships are troubled by model misspecification? For models in which the dependent variable is not continuous, but discrete, truncated or censored, the linearity of the conditional expectations is atypical and therefore the classical linear regression model is usually not applicable. Different types of discontinuities emerge in different models. A discrete dependent variable commonly represents a decision, in case this is a choice between one or zero, coding for respectively yes or no, the binary response model arises. An example of a binary

choice model would be D = I(X0β + ε ≥ 0), in which D is an observed dummy

variable that equals zero or one, X is a matrix with observed regressors, β is a vector of coefficients to be estimated, ε a vector with unobserved errors, and I(·) is an indicator function that equals one if its argument · is true and zero otherwise (Dong & Lewbel, 2015). In regression analysis the initial goal is to estimate β, but with binary dependent variables the interest lies mainly in computing the marginal effects of X on the choice probabilities, or in other words, the probability of one making a yes or no decision based upon the (nonlinear) relationship between the dependent variable and one explanatory variable (Wooldridge, 2010).

In discrete response models like the binary choice model, researchers are faced with econometric challenges due to the bounded range of the dependent variables. They often have to impose strong assumptions on the distribution of the unobserva-ble disturbances and functional form of the model for estimation of the coefficients of interest. Misspecification of these assumptions can lead to problems with identi-fication of the parameters, estimation and finally inference, essentially the validity

(9)

of the results. Deegan (1976) identifies four major causes of misspecification: in-adequate theory, sampling constraints, over reliance on “automatic” model building procedures and multicollinearity. The solution for this dispute is to pose fewer as-sumptions on the models, hereby making a trade-off between weaker conclusions, but likely more robust results.

Endogeneity can be the result of model misspecification, a major problem with endogenous regressors is that they cause biased and inconsistent estimates. Blundell and Powell (2003) state that the main contribution of econometrics to statistical science is the analysis of data with endogenous regressors. Endogeneity arises when there is correlation between the regressors and the error term. The explanatory variables now have two effects on the dependent variable, a direct effect and an indirect effect through the disturbance affecting the regressor, which in turn affects y. In regression analysis only the first effect is to be estimated (Cameron & Travedi, 2005). The causes of endogeneity can vary between omitted variables, measurement errors, simultaneity, selection mechanisms or misspecified dynamics (Heij et al., 2004).

The concern of this thesis will be the misspecification in limited dependent vari-able models, specifically endogeneity in binary choice models. Methods have been developed that try to eliminate this problem. In their paper, Lewbel, Dong and Yang (2012) discuss relative advantages and disadvantages of four ways of estima-ting binary response models when regressors may be endogenous. The estimators they compare are the instrumental variable (or two-stage least squares) estimator, obtained by estimating a linear probability model with instrumental variables and maximum likelihood estimators, resulting from a probit or logit regression. They also discuss control functions and an extension of the former, special regressor meth-ods.

The aim of this paper is to examine the extent of the effects of endogeneity on the estimation of models containing a binary dependent variable under different

(10)

scenarios. To do this, a theoretical research is carried out with the use of synthetic data, designed in Matlab. Varying in instrument strength and level of correlation, various Monte Carlo simulations are carried out and the bias of obtained estimates is determined and compared.

In the next section a theoretical framework is issued for this investigation, section 3 provides the design and content of the research, section 4 the results and in the final section contains the conclusions and recommendations based on the results.

(11)

2 Theoretical framework

The scope of the next section is to give a general outline of the econometric theory available that is of importance to this study. A description of endogeneity and binary choice models is given, along with a more extensive explanation of the linear probability model and its flaws when it comes to models with binary dependent variables. Next, a nonlinear model is described. Also, methods of estimation in case of endogenous regressors are outlined, namely instrumental variable/two-stage least squares, IV probit and the Rivers Vuong approach, all of which can be seen as control function methods. Finally, the concept of Monte Carlo simulation is explained.

2.1 Endogeneity

In econometrics, a regressor xj is said to be endogenous if it is correlated with the

disturbance ε and exogenous if it is uncorrelated with ε, so essentially endogeneity means that cov(xj,ε) 6= 0. If this is the case, it is problematic to isolate the effect

of xj on y, because deviations in xj are related to deviations in y through X0β and

the deviations in the residual. This entails that the ordinary least squares estimates are no longer consistent and the conventional results (t-test, F-test, etc.) are not valid anymore. If an estimate is inconsistent, it is purely and simply uninterpretable (Antonakis et al., 2014).

In their paper, Antonakis et al. (2014) state common causes of endogeneity, such as omitted variable bias, simultaneity and measurement errors. Omitted variable bias happens if a variable is correlated with an independent variable in the model, as well as with the disturbance term. This can occur due to self-selection, for example, if one regressor zi is dropped, perhaps because there is no data available from the

true model (1), and is absorbed by the other regressors (2).

(12)

yi = α + βxi+ εi εi = γzi+ ui (2)

Simultaneity takes place when two variables simultaneously affect each other or if an explanatory variable is potentially caused by the dependent variable (reverse causality). Finally, measurement errors in the independent variables can also cause endogeneity. For example if only xi = x∗i + vi can be observed instead of x∗i, but it

is modelled as xi anyway.

yi =α + β(xi− vi) + εi

yi =α + βxi+ (εi− βvi)

yi =α + βxi+ ui ui = εi− βvi

(3)

From (3) it is obvious that xi and ui both depend on vi and are therefore correlated,

causing inefficient estimates. It should be noted that if a measurement error occurs in the dependent variable, no endogeneity problems arise.

2.2 Binary Choice Models

In limited response models the dependent variable can only take on a finite number of outcomes. The most limited variable is found in a binary choice model, in this case the dependent variable can only take on the values zero and one, which indicates whether or not a certain event has happened. Let yi denote the binary dependent

variable. It is traditional to refer to yi = 1 as a success and yi = 0 as a failure

(Wooldridge, 2010).

Binary choice models are commonly used to formulate models that contain a latent variable (Heij et al., 2004). Let y_i∗ be a continuous variable that is not observed and assume that y_i∗ is determined by the model. A binary choice model is

(13)

drafted as follows: y_i∗ =x0_iβ + εi yi =      1 if y∗_i > 0 0 otherwise (4)

In the next sections models that can contain binary dependent variables and methods of estimation of these models that are troubled by endogeneity are be explained.

2.2.1 The Linear Probability Model and IV/TSLS

Through the linear probability model for binary response variable yi, the parameters

are estimated by linearly regressing y on X by ordinary least squares. It therefore assumes that yi is linear in parameters, yi = x0iβ + ε, and that the regressors are

exogenous. Estimation is also possible by linear two-stage least squares with in-strument matrix Z if some regressors are endogenous. In case there are endogenous regressors, all the requirements for LPM are the existence of Z and that E(X0Z) has full rank (Lewbel, Dong & Yang, 2012).

There are some problems with LPM; the most commonly recognized problem is that of unbounded predicted probabilities. The linear probability model ignores a principle concept of probability theory, namely that the probability of any outcome must be within the interval [0,1]. However, some of the fitted probabilities, can take on values below zero or above one. This not a problem with continuous dependent variables, but in case of binary dependent variables, it can be seen from (5) that the conditional expectation represents a probability. The linear probability model for a binary dependent variable yi is specified as:

E(y|x; β) = P(y = 1|X) = X0β (5)

Another problem with this definition of the LPM for a binary dependent variable is that the residual is heteroskedastic by definition. For all values X can take

(14)

on, ε must equal 1–X0β or –X0β. Hence, var(ε) = X0β[1 − X0β]. The classical linear regression model requires homoscedastic error terms for efficient estimates and for the estimates to be BLUE, a violation of this assumption can be cause for the Gauss-Markov theorem to be violated. This can be easily solved by obtaining estimates of the standard errors, like White-Huber standard errors, that are robust to heteroscedasticity.

Further, the residual ε cannot be normally distributed. Because yi can only

take on two values, ε has become a random variable with a discrete distribution. The assumption of normally distributed errors is critical for calculating relevant test statistics for hypothesis tests after estimation.

IV/TSLS If in a linear probability model the explanatory variables in X are corre-lated with the error terms of the regression model, OLS gives biased and inconsistent estimates. In this case one might resort to instrumental variables for consistent es-timation, provided there exists an instrument z that does not itself belong in the explanatory equation, but is associated with changes in x and does not lead to (di-rect) change in y (Cameron & Travedi, 2005). In other words, the instrument matrix Z must meet the following requirements:

• Exogeneity: Cov(Z,ε) = 0. Z has to be exogenous. If this requirement, also referred to as the exclusion restriction, is not met, Z is an invalid instrument. • Relevance: Cov(Z,X) 6= 0. A strong correlation of the instrument and the endogenous regressor entails a strong first stage and better estimates, weak instruments on the other hand can cause biased IV estimators (Stock & Yogo, 2005).

• Order condition: m ≤ k. The number of instruments must be at least equal to the number of endogenous components in X; if not, the model is unidentified. The selection of valid instruments is done by derivation from the data generating

(15)

process. The exogeneity requirement however, cannot be derived from data because ε is unobserved. The choice of instrument should therefore be substantiated through

expert knowledge. The rule of thumb for instrument strength and relevance is

derived from Stock and Yogo’s paper (2005) and entails that the F statistic for joint significance of the instruments should be at least 10.

If the number of instruments in Z is equal to the number of endogenous regressors in X, the model is just-identified. Given that is the case, the IV-estimator (6) is consistent.

ˆ

βIV = (Z0X)−1Z0y (6)

For over-identified models the IV-estimator can still be used, provided that a few instruments from Z are dropped so the model becomes just-identified, but this can result in an efficiency loss (Cameron & Trivedi, 2005). Instead, the two-stage least squares (TSLS) estimator is widely used in econometrics. This is an IV-estimator, but relies on an auxiliary regression to combine multiple instruments in case of over-identification of parameters to produce the best instrument needed to implement IV.

Stage 1: Regress X on Z (X = Zδ + η), and save the predicted values ˆ

δ =Z(Z0Z)−1Z0X ˆ

X =Z ˆδ = Z(Z0Z)−1Z0X = PZX

(7)

Stage 2: Regress Y on the predicted values from the first stage

Y = ˆXβ + ζ =⇒ ˆβT SLS = (X0PZX)−1X0PZY (8)

2.2.2 Index Models and Control Function Methods

Logit and probit regression models with binary dependent variables categorize as index models and are used for nonlinear estimation. They depend on the func-tional form of G in (9) to be respectively a standard logistic or a normal cumulative

(16)

distribution function, see Appendix A1 and A2. Because G is a CDF, the proba-bility expression in (9) is contained within the interval [0,1]. Index models thereby eliminate the main disadvantage of the LPM. However, the assumption that the dis-tribution of the dependence on the explanatory variables is known, is seldom true. If the functional form is misspecified, the obtained estimates and conclusions drawn based on them are unfounded (Horowitz & Savin, 2001). In this case, some semi-parametric (as opposed to semi-parametric) estimation method as discussed by Bludell and Powell (2004) can be used, these are beyond the scope of this study.

E(y|x; β) = P(y = 1|X) = G(X0β) (9)

The logit and probit models are almost identical, because the CDFs of the two models are only slightly different in the tails of their distributions, see Appendix A3. This is due to the assumption on the distribution of the errors: the underlying variances of the logistic and normal distribution are different. Thus the logit and probit coefficients are scaled differently and consequently they cannot be compared directly. Generally, scaled logit coefficients are √π

3 times larger than scaled probit

coefficients, because of the normalization of the error variance for the logit model (Train, 2009). In the end the choice between a logit and probit model depends on personal preference. For the continuation of this study, the focus lies more on probit models.

Maximum likelihood estimation is commonly used to estimate the parameters in nonlinear models. Assuming that all observations are independent, this method selects the set of values for the parameters that maximizes the likelihood function and is asymptotically efficient (Horowitz & Savin, 2001). It also suffices to maximize the natural logarithm of the likelihood function, the log likelihood function, due to the monotonically increasing nature of the natural logarithm function. If y is a binary dependent variable, the log likelihood function is:

(17)

`(y|x; β) =

N

X

i=1

{yiln G(xiβ) + (1 − yi) ln [1 − G(xiβ)]} (10)

The principle question in MLE is what value of β would make this sample most likely? To answer this question and to obtain the maximum likelihood estimator

ˆ

βM L, the likelihood function is maximized and it is checked if the first and second

order conditions for a maximum are satisfied. The first order condition is a set of equations in the form of:

N X i=1 yi g(xiβ) G(xiβ) + (1 − yi) g(xiβ) [1 − G(xiβ)] xi = 0 (11)

Oftentimes there is no analytic way to find the solution to these FOCs, so one has to rely on numerical optimization algorithms. Due to the globally concave shape of the log likelihood function, these usually converge well to the unique global maximum. The maximum likelihood estimator ˆβM L is consistent and

asymptoti-cally normally distributed. Yet, for ML estimation of the probit model the latent error term is assumed to be normally distributed and homoscedastic. In case of heteroscedasticity or nonnormality ˆβM L is inconsistent.

Control Function Methods Issues that arise in linear models with continuous

endogenous regressors apply to nonlinear models as well. So the usual probit ML estimator becomes inconsistent, but just as with the LPM, two-step estimation methods are available (Adkins, 2008). Heckman first introduced the methodology that is now known as the control function method, although he focused primarily on probit selection models that are not discussed in this study (1979).

The control function approach relies on the same identification conditions as standard IV methods and leads to the two-stage least squares estimators if the endogenous variables in the model are linear in parameters. In (7) the OLS estimates from stage 1 are CF estimates. The inclusion of the fitted values to the stage

(18)

2 estimation “controls” for the endogeneity of xj in the original equation. The

endogenous explanatory variables are made satisfactorily exogenous by adding the appropriate control functions in the first stage (Wooldridge, 2015).

In contrast to IV estimators, that only require exogeneity and relevance of the instruments, CF estimators just like MLE, also require the exact right set of instru-ments and that the structural form of the first-stage model is correctly specified, though not necessarily parametric (Dong, Lewbel & Yang, 2012). Another require-ment of control functions is the endogenous covariates have to be continuous; if not, the disturbance in the first stage usually cannot be normally distributed and is he-terogenous by definition. In two-stage least squares estimation they can be discrete, continuous or some mixture of both.

Still the CF approach is widely used for nonlinear models where the first stage is in linear reduced form. The estimates are consistent and asymptotically normal under the correct assumptions. Adkins (2008) compares various instrumental vari-able probit estimators, including the IV probit estimator, an estimator developed by Rivers and Vuong and Newey’s minimum chi-squared estimator (AGLS).

IV probit The IV probit estimator uses the predicted values from a first stage least squares estimation of the endogenous explanatory variable on the exogenous explanatory variables. In the second stage these predicted values replace the en-dogenous variable in the final MLE estimation of a regular probit model.

RV approach In their paper Rivers and Vuong (1988) suggest adding the

least squares residuals from the first stage of IV probit estimation, in addition to the predicted value of the endogenous regressor, to the final MLE estimation. They hereby assume homoskedastic-normal errors for the reduced form equation. An extension of the RV approach is the Amemiya’s generalized least squares estimator as introduced by Newey (1987). This method has become a standard in case maximum likelihood estimation is difficult to obtain, but is beyond the scope of this study.

(19)

2.3 Synthetic Data and Monte Carlo Simulations

Computer-based simulation methods can create synthetic data for researchers to test their theories on. The benefit of a theoretical investigation over an empirical one is that for a synthetic data set, all parameters are essentially known and set to the specific needs of the investigator. Therefore, any alterations in the outcome of a regression for different estimation methods can be directly contributed to the change made by the investigator themselves. Due to the rapidly increasing growth in computing power and the development of quantum computers, immensely big data sets can be created solely designed for one project.

Monte Carlo simulations Ulam and Metropolis (1949) developed the simulation

technique now known as the Monte Carlo method. Usually, computer simulations are deterministic, but Monte Carlo simulations are iteratively evaluating a deterministic model using sets of random numbers as inputs, thereby mimicking samples from an actual population. It can “provide a thorough understanding of the repeated sample and sampling distribution concepts, which are crucial to an understanding of econometrics” (Kennedy, 2003).

Monte Carlo simulations can be used in situations in which the result of a single simulation is insufficiently representative for the realistically expected outcome con-sidering the variation in input values, or if it is certain that the variation in input values can be estimated with reliable outcome.

A Monte Carlo simulation is conducted in three phases (Adkins & Gade, 2012). First the data generating process is modelled. The probability distributions of the input values are determined and for every individual simulation, a new set of input values is determined. Secondly, the simulations are executed. In the end all the outcomes are ordered and the outcome is presented.

(20)

3 The Data Generating Process

In the next section the model and the estimation methods used to examine the severity of endogeneity in binary choice models are noted and explained, as well as the data generating process. The variation in parameters finally allows comparison between the estimators of models that are bothered by endogeneity, (Guilkey & Lance, 2014). In this study two sample sizes are considered, N = 500 and N = 5000, the number of simulations is set at 1000. All code is written in Matlab and can be found in Appendix B1-B3.

3.1 The Model

The model that is considered stems from the model used by Adkins (2008). The second equation in (12) is the reduced form equation of xi in matrix form and

displays the relationship between the endogenous regressors and the instruments. The model contains a latent dependent variable (y∗_i) and is concerned with the following structure:

y∗_i = X0β + ε

xi = Z0γ + η

(12)

In this model i = 1,. . . ,N, X = [X1X2X3] is a N×(k1 + k2) matrix, X2 a

N×p matrix with p endogenous variables and [X1X3] is a N×k1 matrix with k1

exogenous variables. So the instrument matrix Z = [Z1Z2Z3] is a N×(k1 + p)

matrix with Z3 a N×k2 matrix containing k2 additional instruments. β0 = [β1 β2 β3]

and γ0 = [γ1 γ2 γ3] are vectors containing regression estimates and ε and η are

vectors with disturbances. The observed binary depenent variable yi = 1 if yi∗ ≥ 0

and 0 otherwise. In case p = k2, the model is just-identified, the over-identified case

(21)

Dimension Chosen Values

p 1

k1 2

k2 1

Table 1: Chosen matrix dimensions for the data generating process

3.1.1 Design

y_i∗ = β1+ β2xi+ β3zi+ εi (13)

xi = γ1+ γ2y∗i + γ3wi+ ηi (14)

Applying the dimensions in table 1, the model examined contains one continuous endogenous regressor xi (p = 1), a constant and an exogenous regressor. Also

k2 = 1, so Z3 contains one additional instrument. Given these dimensions, X and

the instrument matrix Z essentially are:

X =         1 x1 z1 1 x2 z2 .. . ... ... 1 xN zN         Z =         1 z1 w1 1 z2 w2 .. . ... ... 1 zN wN         (15)

The exogenous variables X3 = zi and Z3 = wi are drawn from the multivariate

normal distribution with zero means, variances equal 1 and covariances 0.5. To generate correlation between the endogenous regressor xi and the regressions error,

the disturbances are created by (16), ηi and ξi are drawn from the standard normal

distribution and λ is varied on the interval [-1,1].

εi = ληi+ ξi (16)

A new parameter θ is introduced to vary the strength of the instruments in (14), by γ = θγ and is varied on the interval [0.05, 1]. Next the parameter values in γ and the true β are chosen, see table 2. These values for β are used to compute the

(22)

mean squared error, to finally compare the estimates of different methods.

Parameter Chosen Values

β [0, -0.8, 1]

γ [1, -0.5, -1]

Table 2: Chosen (true) parameter values for the data generating process

Then the endogenous regressor xi is calculated from (13) and (14) as follows:

xi =

γ1+ γ2(β1 + β3zi+ εi) + γ3wi+ ηi

1 − γ2β2

(17)

It is hereby obvious that xi is correlated with the disturbance ε and is therefore an

endogenous regressor.

3.2 Estimation Methods

To use the estimation methods below, firstly a binary dependent variable is created from the latent variable in the model. In case the value of y_1i∗ is greater than or equal to zero, it returns 1 in the binary dependent variable, if it is less than zero, 0. These values are then saved in a vector called y true binary, as can be seen in Appendix B3. Just like the values in the X and Z matrices, y true binary is supposedly observed and is therefore used for all further estimations, but will be referred to as y in the next section. In the next section the expressions used to estimate the results in section 4 are exemplified.

3.2.1 Linear: OLS/TSLS

In the case of endogeneity, OLS estimation provides inconsistent estimates, however these will be used as a reference. The model is just-identified, so IV estimation is sufficient.

ˆ

(23)

3.2.2 Nonlinear: Maximum Likelihood Estimation

As stated before, analytically solving the equations (11) that make up the FOCs of a maximum is usually not possible. Therefore a function has been developed to numerically find the maximum likelihood and the estimates that go along with it, see Appendix B1. This function is used to estimate the probit ML estimates in a regular probit setting, but also for the IV probit method. For the Rivers Vuong approach the sum of the fitted residuals is added, see Appendix B2. Next the first and second stages of the CF methods used are described.

IV Probit

Stage 1: Estimate (14) with OLS and save the predicted values of the endoge-nous regressor xi: ˆ γOLS = (Z0Z)−1Z0x ˆ x = Z ˆγOLS (19)

Stage 2: Replace the endogenous regressor in X by the fitted values from (19), so ˜X = [X1Xˆ2X3] and use maximum likelihood estimation to find ˆβ:

P(y = 1|X) = Φ( ˜X0β) (20)

RV approach

Stage 1: Equivalent to stage 1 of IV probit, but now also compute the least squares residuals ˆη = xi− Z0γ as well.ˆ

Stage 2: Use maximum likelihood estimation to find ˆβ:

(24)

4 Results

In the following section the results obtained from the Monte Carlo simulations are discussed. Tables A4 - A7 in Appendix A show the mean squared errors per esti-mator, categorized by design. To increase readability the results are printed twice, firstly with the varying instrument strength θ as most left column, secondly the amount of correlation, determined by λ in (16). As stated in section 3, two sample sizes are considered, N=500 and N=5000 and both have been subjected to 1000 simulation runs. The average amount of 1’s in a single simulation in the samples are respectively 189 and 1989, leaving it at about 37, 8% for both samples.

4.1 Summary Design Parameters

To better grasp the meaning of varying the instrument strength in the reduced equation (14), the coefficient of determination R2 _{and the overall F statistic of}

regression are computed and shown in table 3 per value of θ. θ 0.05 0.1 0.25 0.5 1 N = 500 R 2 _0.00253 _0.01070 _0.07496 _0.25885 _0.59227 Overall-F 0.6 2.7 20 87 361 N = 5000 R 2 _0.00715 _0.02208 _0.10771 _0.31400 _0.64057 Overall-F 0.6 2.7 20 87 361

Table 3: Goodness of fit per value of θ for reduced form endogenous ex-planatory variable xi

Stock and Yogo (2005) state that instruments should be considered ‘weak’ if the overall F-statistic is below 10. For the continuance of this study this ‘rule of thumb’ is maintained and so the distinction between weak and strong instruments is made. Because of that the instruments used in the smaller sample size Monte Carlo simulations, with θ = 0.05 and θ = 0.1 are labelled weak.

(25)

The second design parameter is λ. It can be seen from table 4 that when λ = 0 there is 0 correlation between the explanatory variable in (14) and the disturbances in regression equation.

λ

1 0.5 0 -0.5 -1

corr(ε, η) 0.705 0.442 0 -0.453 -0.710

Table 4: Level of correlation between ε and η

4.2 Comparison of Estimates

All comments below refer to values that can be found in tables A4 - A7 in Appendix A.

In case there is no endogeneity, it is to be expected that regular estimation methods like probit maximum likelihood estimation and LPM ordinary least squares work fine. It is clear that when λ = 0, the probit estimates for both sample sizes are almost unbiased and therefore by far the most accurate. The LPM estimates however, are considerably more biased. Overall the bias is within the interval 0.39-0.55, in both sample sizes for all values of lambda and theta. It is peculiar that the bias in LPM and probit estimates is increasing as the instrument strength increases. Furthermore, it should be noted that when error correlation and instrument strength are low, so when endogeneity is less evident, the methods that try to correct for endogeneity seemingly do not perform as well as methods that ignore it.

IV estimation is performing rather disappointingly. As the instrument strength increases, the bias decreases but it is still the estimation method with the highest overall bias for both sample sizes. The high bias in LPM and IV/TSLS is most likely due to the binary dependent variable and the problems that come with estimating linear models with nonlinear dependent variables as described in section 2.

The nonlinear estimation methods are performing better. For both sample sizes the ML estimates of a regular probit model are the least biased when instrument

(26)

strength is low (θ = 0.05 and θ = 0.1). The turning point is when θ = 0.25, from then on both CF methods are returning the most unbiased estimates in comparison to the other methods. Both IV probit and the Rivers Vuong approach seem to have better estimates when N = 5000 then when N = 500 and instrument strength is 0.05. This might be due to the weak instruments in the in case N = 500 and θ = 0.05 or θ = 0.1.

(27)

5 Conclusion

The aim of this study was to examine the severity of endogeneity in models with binary dependent variables under different scenarios. Based on the outcome of a Monte Carlo experiment with 1000 simulations, a comparison was made in terms of the mean squared errors of the obtained estimates.

Although the linear probability model has multiple known disadvantages, like fit-ted probabilities outside the unit interval and nonnormal and heteroskedastic errors, this was the first model that was looked at. Ordinary least squares and two-stage least squares estimates have been derived of this model.

A more logical way to present binary choice models is by index models, like probit/logit. Because the functional form of these models depend on a cumulative density function, the fitted probabilities cannot exceed zero or one by definition and thereby take away the main disadvantage of the linear probability model. However, the assumption of fully knowing the distribution of the disturbances is a rather strong one.

Besides linear IV models, the two other methods of estimation that have been looked at are known as control function methods, first introduced to the world by Heckman (1979). IV probit and a method developed by Rivers and Vuong (1988) both are two-step estimation methods that rely on fitted values of exogenous re-gressors and residuals to control the endogeneity in the regression. The model that is used for the data generating process was derived of the model used by Adkins (2008) and was varied by two design parameters, namely θ which controlled for the instrument strength and λ which determined the level of correlation in the model.

The results of the simulations were not surprising in the sense that when there was no correlation, the probit ML estimator displayed the least bias. The linear estimation methods, LPM and IV, performed rather disappointingly. This might be due to the problems described with linear estimation of models with limited depen-dent variables. Overall, when correlation and instrument strength are low, methods

(28)

that ignore endogeneity perform better than methods that try to compensate for it. As endogeneity increases, IV probit and the RV approach return the least biased estimates.

For further study it is advised to look at less parametric methods and more semi/nonparametric methods of estimation. These methods usually pose less as-sumptions on the structural model and are therefore perhaps more reliable.

(29)

References

Adkins, L. C., & Gade, M. N. (2012). Monte carlo experiments using stata: a primer with examples. In 30th anniversary edition (pp. 429–477). Emerald Group Publishing Limited.

Adkins, L. C., et al. (2008). Small sample performance of instrumental variables probit estimators: A monte carlo investigation. In Joint statistical meetings proceedings, business and economics statistics section. American Statistcal Association.

Antonakis, J., Bendahan, S., Jacquart, P., & Lalive, R. (2014). Causality and endogeneity: Problems and solutions. In The oxford handbook of leadership and organizations (pp. 93–117). Oxford University Press New York.

Blundell, R., & Powell, J. L. (2003). Endogeneity in nonparametric and semipara-metric regression models. Econosemipara-metric society monographs, 36 , 312–357. Blundell, R. W., & Powell, J. L. (2004). Endogeneity in semiparametric binary

response models. The Review of Economic Studies, 71 (3), 655–679.

Cameron, A. C., & Trivedi, P. K. (2005). Microeconometrics: methods and appli-cations. Cambridge university press.

Deegan Jr, J. (1976). The consequences of model misspecification in regression analysis. Multivariate Behavioral Research, 11 (2), 237–248.

Dong, Y., & Lewbel, A. (2015). A simple estimator for binary choice models with endogenous regressors. Econometric Reviews, 34 (1-2), 82-105.

Guilkey, D. K., & Lance, P. M. (2014). Program impact estimation with binary outcome variables: Monte carlo results for alternative estimators and empirical examples. In Festschrift in honor of peter schmidt (pp. 5–46). Springer. Heckman, J. J. (1979). Sample selection bias as a specification error. Econometrica,

47 (1), 153–161.

Heij, C., De Boer, P., Franses, P. H., Kloek, T., Van Dijk, H. K., et al. (2004). Econometric methods with applications in business and economics. OUP Ox-ford.

Horowitz, J. L., & Savin, N. (2001). Binary response models: Logits, probits and semiparametrics. Journal of Economic Perspectives, 15 (4), 43–56.

Kennedy, P. (2003). A guide to econometrics. MIT press.

Lewbel, A., Dong, Y., & Yang, T. T. (2012). Comparing features of convenient estimators for binary choice models with endogenous regressors. Canadian Journal of Economics/Revue canadienne d’´economique, 45 (3), 809–829.

(30)

Metropolis, N., & Ulam, S. (1949). The monte carlo method. Journal of the American statistical association, 44 (247), 335–341.

Newey, W. K. (1987). Efficient estimation of limited dependent variable models with endogenous explanatory variables. Journal of Econometrics, 36 (3), 231–250. Rivers, D., & Vuong, Q. H. (1988). Limited information estimators and exogeneity

tests for simultaneous probit models. Journal of econometrics, 39 (3), 347– 366.

Stock, J., & Yogo, M. (2005). Testing for weak instruments in linear iv regression. In D. W. Andrews (Ed.), Identification and inference for econometric models (p. 80-108). New York: Cambridge University Press.

Train, K. E. (2009). Discrete choice methods with simulation. Cambridge University Press.

Wooldridge, J. M. (2010). Econometric analysis of cross section and panel data. MIT press.

Wooldridge, J. M. (2015). Control function methods in applied econometrics. Jour-nal of Human Resources, 50 (2), 420–445.

(31)

A

Φ(X0β) = √1 2π Z X0β −∞ e−t22 dt

A1: CDF of the standard normal distribution

Λ(X0β) = e

X0β

1 + eX0_β

A2: CDF of the standard logistic distribution

A3: Plot of the cumulative distribution function of respectively a logistic and standard normal distribution

(32)

Design Estimators

θ λ LPM IV/TSLS probit IV probit RV

0.05 1 0.4897 0.7931 0.3183 18.4790 0.0607 0.05 0.5 0.4257 0.7907 0.0746 0.8607 0.4437 0.05 0 0.3979 0.7716 0.0004 0.3890 0.3861 0.05 -0.5 0.3952 0.7464 0.0969 0.1018 0.0404 0.05 -1 0.4028 0.7256 0.3638 0.4096 0.2370 0.1 1 0.4790 0.7803 0.3004 0.0375 0.1512 0.1 0.5 0.4228 0.7778 0.0667 0.0023 0.0009 0.1 0 0.4024 0.7601 0.0011 0.0582 0.0582 0.1 -0.5 0.4053 0.7372 0.1050 1.1146 1.0858 0.1 -1 0.4165 0.7183 0.3667 0.1745 0.0684 0.25 1 0.4507 0.7428 0.2382 0.0065 0.0076 0.25 0.5 0.4203 0.7411 0.0441 0.0090 0.0062 0.25 0 0.4202 0.7285 0.0053 0.0478 0.0478 0.25 -0.5 0.4365 0.7118 0.1239 0.1143 0.0654 0.25 -1 0.4558 0.6975 0.3308 0.1789 0.0497 0.5 1 0.4350 0.6859 0.1238 0.0245 0.0010 0.5 0.5 0.4384 0.6864 0.0161 0.0276 0.0240 0.5 0 0.4591 0.6806 0.0186 0.0740 0.0740 0.5 -0.5 0.4854 0.6713 0.1510 0.1405 0.0749 0.5 -1 0.5101 0.6639 0.3108 0.2033 0.0362 1 2 0.5578 0.6052 0.2444 0.2055 0.1418 1 1 0.5067 0.6042 0.0503 0.0907 0.0483 1 0.5 0.5126 0.6049 0.2512 0.0948 0.0874 1 0 0.5316 0.6054 0.7416 0.1495 0.1495 1 -0.5 0.5542 0.6058 1.0201 0.2163 0.1000

(33)

Design Estimators

θ λ LPM IV/TSLS probit IV probit RV

0.05 1 0.4892 0.7919 0.3158 0.0073 0.0700 0.05 0.5 0.4256 0.7898 0.0745 0.0016 0.0011 0.05 0 0.3981 0.7702 0.0002 0.0296 0.0296 0.05 -0.5 0.3957 0.7464 0.0920 0.1047 0.0654 0.05 -1 0.4032 0.7260 0.3235 0.1862 0.0839 0.1 1 0.4782 0.7791 0.2974 0.0014 0.0253 0.1 0.5 0.4227 0.7769 0.0667 0.0031 0.0013 0.1 0 0.4026 0.7591 0.0009 0.0339 0.0339 0.1 -0.5 0.4056 0.7374 0.0996 0.0964 0.0551 0.1 -1 0.4168 0.7187 0.3052 0.1617 0.0511 0.25 1 0.4502 0.7420 0.2348 0.0074 0.0061 0.25 0.5 0.4203 0.7405 0.0444 0.0104 0.0076 0.25 0 0.4206 0.7275 0.0048 0.0482 0.0482 0.25 -0.5 0.4366 0.7115 0.1155 0.1119 0.0623 0.25 -1 0.4560 0.6974 0.2131 0.1746 0.0442 0.5 1 0.4343 0.6869 0.1245 0.0253 0.0012 0.5 0.5 0.4386 0.6862 0.0166 0.0298 0.0264 0.5 0 0.4595 0.6796 0.0207 0.0769 0.0769 0.5 -0.5 0.4859 0.6713 0.1972 0.1435 0.0767 0.5 -1 0.5100 0.6638 0.4419 0.2045 0.0365 1 2 0.5574 0.6048 0.2469 0.2076 0.1462 1 1 0.5064 0.6045 0.0838 0.0926 0.0531 1 0.5 0.5130 0.6047 0.8209 0.0987 0.0919 1 0 0.5323 0.6051 1.0634 0.1540 0.1540 1 -0.5 0.5544 0.6056 1.1277 0.2194 0.1020

(34)

Design Estimators

λ θ LPM IV/TSLS probit IV probit RV

0.05 1 0.4897 0.7931 0.3183 18.4790 0.0607 0.05 0.5 0.4257 0.7907 0.0746 0.8607 0.4437 0.05 0 0.3979 0.7716 0.0004 0.3890 0.3861 0.05 -0.5 0.3952 0.7464 0.0969 0.1018 0.0404 0.05 -1 0.4028 0.7256 0.3638 0.4096 0.2370 0.1 1 0.4790 0.7803 0.3004 0.0375 0.1512 0.1 0.5 0.4228 0.7778 0.0667 0.0023 0.0009 0.1 0 0.4024 0.7601 0.0011 0.0582 0.0582 0.1 -0.5 0.4050 0.7372 0.1050 1.1146 1.0858 0.1 -1 0.4165 0.7183 0.3667 0.1745 0.0684 0.25 1 0.4507 0.7428 0.2382 0.0065 0.0076 0.25 0.5 0.4203 0.7411 0.0441 0.0090 0.0062 0.25 0 0.4202 0.7285 0.0053 0.0478 0.0478 0.25 -0.5 0.4365 0.7118 0.1239 0.1143 0.0654 0.25 -1 0.4558 0.6975 0.3308 0.1789 0.0497 0.5 1 0.4350 0.6859 0.1238 0.0245 0.0010 0.5 0.5 0.4384 0.6864 0.0161 0.0276 0.0240 0.5 0 0.4591 0.6806 0.0186 0.0740 0.0740 0.5 -0.5 0.4854 0.6713 0.1510 0.1405 0.0749 0.5 -1 0.5101 0.6639 0.3108 0.2033 0.0362 1 2 0.5578 0.6052 0.2444 0.2055 0.1418 1 1 0.5067 0.6042 0.0503 0.0907 0.0483 1 0.5 0.5126 0.6049 0.2512 0.0948 0.0874 1 0 0.5316 0.6054 0.7416 0.1495 0.1495 1 -0.5 0.5542 0.6058 1.0201 0.2163 0.1000

(35)

Design Estimators

λ θ LPM IV/TSLS probit IV probit RV

0.05 1 0.4892 0.7919 0.3158 0.0073 0.0700 0.05 0.5 0.4256 0.7898 0.0745 0.0016 0.0011 0.05 0 0.3981 0.7702 0.0002 0.0296 0.0296 0.05 -0.5 0.3957 0.7464 0.0920 0.1047 0.0654 0.05 -1 0.4032 0.7260 0.3235 0.1862 0.0839 0.1 1 0.4782 0.7791 0.2974 0.0014 0.0253 0.1 0.5 0.4227 0.7769 0.0667 0.0031 0.0013 0.1 0 0.4026 0.7591 0.0009 0.0339 0.0339 0.1 -0.5 0.4056 0.7374 0.0996 0.0964 0.0551 0.1 -1 0.4168 0.7187 0.3052 0.1617 0.0511 0.25 1 0.4502 0.7420 0.2348 0.0074 0.0061 0.25 0.5 0.4203 0.7405 0.0444 0.0104 0.0076 0.25 0 0.4206 0.7275 0.0048 0.0482 0.0482 0.25 -0.5 0.4366 0.7115 0.1155 0.1119 0.0623 0.25 -1 0.4560 0.6974 0.2131 0.1746 0.0442 0.5 1 0.4343 0.6869 0.1245 0.0253 0.0012 0.5 0.5 0.4386 0.6862 0.0166 0.0298 0.0264 0.5 0 0.4595 0.6796 0.0207 0.0769 0.0769 0.5 -0.5 0.4859 0.6713 0.1972 0.1435 0.0767 0.5 -1 0.5100 0.6638 0.4419 0.2045 0.0365 1 2 0.5574 0.6048 0.2469 0.2076 0.1462 1 1 0.5064 0.6045 0.0838 0.0926 0.0531 1 0.5 0.5130 0.6047 0.8209 0.0987 0.0919 1 0 0.5323 0.6051 1.0634 0.1540 0.1540 1 -0.5 0.5544 0.6056 1.1277 0.2194 0.1020

(36)

B

Matlab Code

function [ val ] = log_lik_p( y_true_binary, X, beta_hat )

val = -sum(y_true_binary.*log(normcdf(X*beta_hat)) +(1-y_true_binary).*log(1-normcdf(X*beta_hat)));

end

B1: Loglikelihood function for probit ML estimation

function [ val ] = log_lik_pRV( y_true_binary, X, beta_hat, S )

val = -sum(y_true_binary.*log(normcdf(X*beta_hat+S)) +(1-y_true_binary).*log(1-normcdf(X*beta_hat+S)));

end

(37)

%%%%%% Script for Monte Carlo Simulations%%%%% rng(1);

sim = 1000; % =#simulations

N = 500; % =#observations

lambda = -0.5; % severity corr. errors

theta = 0.1; % theta [0.05, 2]

mu = [0 0]; sigma = [1 0.5; 0.5 1]; % sigma: var=1, corr(0.5)

gamma = theta*[1; -0.5; -1]; % parameters for instr.

beta = [0; -0.8; 1];

Z_1 = zeros(N,sim); W = zeros(N,sim);

eta = zeros(N,sim); xi = zeros(N,sim);

eps = zeros(N,sim); X_1 = zeros(N,sim);

y_true = zeros(N,sim); X = zeros(N,3,sim);

Z = zeros(N,3,sim); y_binary = zeros(N,1);

for j = 1:sim % creating errors and

EX(:,:,j) = mvnrnd(mu,sigma,N); % (in)dependent var.

Z_1(:,j) = EX(:,1,j); W(:,j) = EX(:,2,j);

eta(:,j) = randn(N,1); xi(:,j) = randn(N,1); eps(:,j) = lambda*eta(:,j) + xi(:,j);

for i = 1:N

X_1(i,j) = (gamma(1) + gamma(2) * (beta(1) + beta(3) * Z_1(i,j) + eps(i,j))

(38)

+ gamma(3) * W(i,j) + eta(i,j)) /(1-gamma(2)*beta(2));

y_true(i,j) = beta(1) + beta(2)*X_1(i,j) + beta(3)*Z_1(i,j) + eps(i,j); if y_true(i,j) > 0 y_binary(i,j) = 1; else y_binary(i,j) = 0; end end X(:,:,j) = [ones(N,1), X_1(:,j), Z_1(:,j)]; Z(:,:,j) = [ones(N,1), Z_1(:,j), W(:,j)]; end %%% METHODS: LPM/IV-TSLS %%% beta_LPMsum = zeros(length(beta),sim); beta_IVsum = zeros(length(beta),sim); for i = 1:sim beta_LPMsum(:,i) = inv(X(:,:,i)’*X(:,:,i))*X(:,:,i)’*y_binary(:,i); beta_IVsum(:,i) = inv(Z(:,:,i)’*Z(:,:,i))*Z(:,:,i)’*y_binary(:,i); end

(39)

beta_LPM = sum(beta_LPMsum’)’/sim; beta_IV = sum(beta_IVsum’)’/sim;

%%% METHODS: PROBIT MLE %%%

beta_MLprobitsum = zeros(length(beta),sim);

options = optimset(’Display’, ’off’, ’Maxiter’, 10000, ’TolX’, 10^-30, ’Tolfun’, 10^-30); beta_hat = [0;1;1]; beta_hat0 = [1;0;-1]; for i = 1:sim log_lik_p_fh = @(beta_hat)log_lik_p(y_binary(:,i), X(:,:,i),beta_hat); [beta_MLprobitsum(:,i)] =

fminunc(log_lik_p_fh, beta_hat0, options); end

beta_MLprobit = sum(beta_MLprobitsum’)’/sim;

%%% METHODS: IVPROBIT MLE %%%

gamma_OLS =zeros(length(gamma),sim);

X_1hat=zeros(N,sim); X_tilde=zeros(N,3,sim); beta_IVprobitsum = zeros(length(beta),sim);

(40)

gamma_OLS(:,i) = inv(Z(:,:,i)’*Z(:,:,i))*Z(:,:,i)’*X_1(:,i); X_1hat(:,i) = Z(:,:,i)*gamma_OLS(:,i); X_tilde(:,:,i) = X(:,:,i); X_tilde(:,2,i) = X_1hat(:,i); log_lik_p_fh = @(beta_hat)log_lik_p(y_binary(:,i),X_tilde(:,:,i),beta_hat); [beta_IVprobitsum(:,i)] =

fminunc(log_lik_p_fh, beta_hat0, options); end beta_IVprobit = sum(beta_IVprobitsum’)’/sim; %%% METHODS: RV APPROACH %%% eta_hat=zeros(N,sim); eta_hatlambda=zeros(N,sim); beta_RVsum = zeros(length(beta),sim); beta_hat0RV = [0; 0; 0]; for i = 1:sim

eta_hat(:,i) = X_1(:,i) - Z(:,:,i)*gamma_OLS(:,i); eta_hatlambda(:,i) = lambda*eta_hat(:,i);

log_lik_pRV_fh = @(beta_hat)

(41)

beta_hat,eta_hatlambda(:,i)); [beta_RVsum(:,i)] =

fminunc(log_lik_pRV_fh, beta_hat0RV, options); end beta_RV = sum(beta_RVsum’)’/sim; %%% RESULTS %%% mse_LPM = mse(beta,beta_LPM); mse_IV = mse(beta,beta_IV); mse_probit = mse(beta,beta_MLprobit); mse_IVprobit = mse(beta,beta_IVprobit); mse_RV = mse(beta,beta_RV);

mse = [mse_LPM, mse_IV, mse_probit, mse_IVprobit, mse_RV]

Binary choice models with endogenous explanatory variables : a Monte Carlo investigation