Selection Bias: A Machine Learning Approach

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics 1

Master’s Thesis

Selection Bias:

A Machine Learning Approach

Tom Meurs

Student number: 10358951

Date of final version: November 30, 2018 Master’s programme: Econometrics

Specialisation: Big Data Business Analytics Supervisor: dhr. dr. N. P. A. van Giersbergen Second reader: dr. J. C. M. van Ophem

(2)

Abstract

Selection bias is an important problem for inference in social sciences (Puhani, 2000). Heck-man (1977) introduced a two-step method to correct for selection bias. Although assumptions were often challenged in empirical research, generalisations of Heckman’s approach allow consis-tent estimates to be obtained under weaker restrictions (Cameron and Trivedi, 2005). Nonethe-less, nowadays the emergence of the field of machine learning might improve current methods (Varian, 2014). In this thesis two new methods are explored. The first method is based on the semiparametric estimation method of Cosslett (1991), where the nonlinear bias function is approximated with dummies (COSS). These dummies are based on probit estimation of the decision equation. In the new method (COSSNN), Cosslett’s dummies are based on the estima-tion of the decision equaestima-tion with a neural network. The second new method (2NN) is based on the outcome equation. The bias is directly estimated with a neural network. Both methods are expected to give unbiased results when regressors are nonlinearly misspecified. First, a simulation was performed and confirmed that a neural network has higher AUC than probit when the data generating process is nonlinearly misspecified and the signal-to-noise ratio (SNR) was larger than 0.5. In the second stap, both COSS and COSSNN seem to give similar esti-mates. 2NN gives biased estimates under all analysed conditions. Afterwards, the methods were applied to an empirical dataset. Positive health expenditure was estimated from the Medical Expenditure Survey Panel (MEPS, 2018). First-step estimation is better estimated with neural network than probit. For second-step estimation, both COSSNN and 2NN give AIC’s smaller than OLS, Heckman’s two-step method and COSS. Although results from the simulation study are contradicting for 2NN, COSSNN might be an improvement on current methods for solving selection bias. In addition, these results might inspire econometricians and computer scientists to combine current models and practices from both fields in new ways.

(3)

ii Statement of Originality

This document is written by Tom Meurs who declares to take full responsibility for the content of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the content.

(4)

Introduction

One of the big contributions of econometric methods to social sciences is the notion of selection bias and correction during analysis (Toomet and Henningsen, 2008). Since subjects can often decide for themselfes whether they for example take a drug, attend a school, or want to work, a random sample of the population might be hard to obtain, sometimes even impossible. Therefore, it might be harder for researchers of social sciences to report valid causal effects for the whole population, since the selective sample might bias the results from standard methods used for inference. To counter that bias, econometricians developed methods which can correct for this selection bias (Puhani, 2000).

Sample selection problems are defined as a Type II Tobit model (Puhani, 2000; Cameron and Trivedi, 2005), for which most solutions are based on Heckman (1977). At first, maximum likelihood (ML) was used because the model is fully parametric and therefore ML is most efficient. During that period, the two-step estimator of Heckman was used for exploratory analysis, since computation of the ML were intractable and expensive (Toomet et al., 2008). Later on, Heckman’s method became increasingly popular, because certain extensions with semi- and non-parametric estimation make the model more flexible compared to ML (Hall, 2002) and no explicit bivariate distribution of the error terms needs to be specified (Puhani, 2000). A variety of these extensions of Heckman’s two-step approach can be found in the literature (Cosslett, 1991; Cameron and Trivedi, 2005; Newey, 1999, 2009; Robinson, 1988).

These methods seem to work reasonably, even when the assumption of error terms being bivariate normally distributed is relaxed. Nevertheless, the emergence of the field of machine learning might improve current methods. Compared to econometrics, machine learning tech-niques are often more accurate in prediction (Varian, 2014). This might explain why machine learning is becoming so popular in both the scientific world as with businesses. Nonetheless, machine learning seems to be less suited for inference than econometrics (Varian, 2014; Mul-lainathan and Spiess, 2017). According to MulMul-lainathan and Spiess (2017), this is because they do not produce standard error terms (SE) and their general way of fitting the data leads to identification problems: two functions with very different coefficients can produce similar pre-diction accuracy. So the way an algoritm chooses between variables and therefore models, can

(7)

CHAPTER 1. INTRODUCTION 2 be just a result of a flip of a coin.

Nevertheless, it might be that machine learning techniques can improve a part of an econo-metric model and therefore improve the prediction of the overal model, without losing the ability to do inference. In this thesis, I want to investigate new methods for solving selection bias problems, which combines both the predictive accuracy of machine learning methods and the ability of inference from econometric methods. This leads to the following question: can Heckman’s two-step method be improved in terms of efficiency and/or accuracy by applying machine learning techniques?

First a simulation study will be performed, to optimize the new methods and to compare them on accuracy and inference to Heckman’s two-step estimator and other semiparametric estimation methods. Subsequently, it will be empirically tested on the Medical Expenditure Panel Survey dataset (MEPS, 2018).

The remainder of the thesis is organized as follows. In Chapter 2 the selection bias problem is defined and a review is given of proposed solutions in econometric and machine learning literature. In Chapter 3 two models based on both econometric and machine learning tech-niques are introduced and optimized with a simulation study. In Chapter 4 the MEPS dataset is described, to which the models are then applied in Chapter 5. In Chapter 6 implications for further research in sample selection models will be given.

(8)

Theoretical Background

In this chapter a theoretical background of sample selection problems is described. In the first section an econometric perspective is stated. In the second section the machine learning perspective is explained and both fields are compared. In the final section a brief summary of the most important points is presented.

2.1 Econometrics

2.1.1 Heckman’s method

The problem of sample selection bias has received a great deal of attention in econometrics. This is due the use of surveys by many economic researchers. Often, people responded to a survey by a self-selection mechanism, so they did not represent a random sample of the population (Zadrozny, 2004). Therefore, inference was often a problem and solutions were postulated to correct for this selection bias (Heckman, 1977).

The structural process to describe this problem is the following (Puhani, 2000):

y_1i∗ = x0_1iβ1+ 1i (2.1)

y_2i∗ = x0_2iβ2+ 2i (2.2)

y2i= y2i∗ , if y1i∗ > 0 (2.3)

y2i= 0 , if y1i∗ ≤ 0, (2.4)

where y_1i∗ the latent decision equation (2.1) and y∗_2i is the latent outcome equation (2.2). For y_1i∗ < 0 the decision is made to not participate, and a 0 is observed for y2i. So, based on x1i a

decision is made to participate and then x2iis the explanatory variable for the outcome. x1iand

x2i may or may not be equal. When taking the conditional expectation of (2.2), the observed

dependence between y_2i∗ and y∗_1i can be written as:

E(y2i|x2i, y∗1i> 0) = x02iβ2+ E(2|1 > −x01iβ1). (2.5)

(9)

CHAPTER 2. THEORETICAL BACKGROUND 4 Now applying OLS to (2.5) will give biased results, as E(2|1 > x01iβ1) 6= 0. Assuming the

error terms follow a bivariate normal distribution: 1 2 ! ∼ N 0 0 ! , σ 2 1 ρσ1 ρσ2 σ22 !! , (2.6)

Heckman (1977) showed that the conditional mean can be rewritten as:

E(y2i|x2i, y1i∗ > 0) = x2iβ2+ ρσ2

φ(−(x0_1iβ1/σ1))

1 − Φ(−(x0_1iβ1/σ1))

, (2.7)

and proposed to estimate the so-called inverse Mill’s ratio estimator λ:

λ(x0_1iβ1/σ1) =

φ(−(x0_1iβ1/σ1))

1 − Φ(−(x0_1iβ1/σ1))

, (2.8)

by using a probit on the decision equation (2.1) and then adding the inverse Mill’s ratio based on the estimates of the probit to the outcome equation (2.2):

y2i= x02iβ2+ ρσ2ˆλ + H2i, (2.9)

where H_2i≡ _2i−ρσ₂λ and ˆˆ λ = λ(x0_1iβˆ1/σ1). Now ρσ2can be consistently estimated by OLS. So,

Heckman (1977) showed this selection bias can be seen as an omitted variable problem. However, as noted in the introduction, the model is fully parametric, and thus maximum likelihood (ML) would also be possible (Amemiya, 1985):

L = Q y2=0 1 − Φ _x0 1β1 σ1 Q y2>0 Φ x0₁β1+σ_σ122 2 (y2− x02β2) r σ2₁−σ212 σ2 2 1 σ2φ _(y 2−x02β2) σ2 , (2.10)

where the first part is the likelihood of not participating based on the probit model (2.1), and the second part the likelihood of a Tobit model (2.2). Note that here only the ratio of β_σ can be estimated (Amemiya, 1985). The ML approach is still less popular then Heckman’s approach, because a full specificiation of the bivariate distribution of the error terms should be specified and the likelihood is often not globally concave (Hall, 2002).

2.1.2 Semiparametric Techniques

This model of Heckman was introduced in the 1970s and 1980s. Nevertheless, two assumptions were often challenged in practice: the need for valid instruments and distributional assumptions (Puhani, 2000). During recent decades, various semiparametric estimation techniques have been increasingly used in addition to the Heckman model (Cosslett, 1991; Toomet et al., 2008; Cameron and Trivedi, 2005; Newey, 1999, 2009; Robinson, 1988; Powell, 1986). Among the most popular (Cameron and Trivedi, 2005) are the methods of Newey (1999), Cosslett (1991), Gallant and Nychka (1987) and Robinson (1988).

Newey (1999) modified Heckman’s second step by approximating the bias correction not with the inverse Mill’s ratio, but with known, smooth functions pk(·):

(10)

y2i= x02iβ2+ K

X

k=1

ηkρk(x01iβˆ1) + N2i, (2.11)

with ηk parameter and N2i ≡ 2i− K

P

k=1

ηkρk(x01iβˆ1)k−1. Smooth functions used by Hussinger

(2008) are ρk(x01iβˆ1) = τk(x01iβˆ1) k−1

with τk(x01iβˆ1) = 2Φ(x 0

1iβˆ1) − 1. K is degree of polynomials

used to estimated the correction term. Newey (1999) shows that this method gives consistent estimators, as long as the degree of polynomials increases with the amount of observations, and

ˆ

β1 is consistently estimated.

Cosslett (1991), like Newey’s method, adjusts the second step in Heckman’s method. Now not polynomials, but dummies are added. On the bases of probit estimates1βˆ1, outcomes get

sorted and grouped and M dummies Dm are formed and added to (2.2):

y2i= x02iβ2+ M X m=1 bmDim(x01iβˆ1) + C2i, (2.12) where C 2i ≡ 2i− M P m=1

bmDim(x01iβˆ1). Cosslett (1991) shows that this method gives consistent

estimators, as long as the number of dummies increases with the amount of observations, and ˆ

β1 is consistently estimated.

Gallant and Nychka (1987) show that certain conditional bivariate distributions can be approximated by a hermite series. This can be used to correct for the estimation bias through the maximum likelihood (2.10), instead of Heckman’s two-step estimator. Rewriting (2.10) and adding a binary variable zi, which is 0 if y2iis observed and 1 otherwise, the following integrals

can be obtained (Van Der Klaauw and Koning, 2003):

L = n Y i=1 Z ∞ −x0 1iβ1 f (yi− x01iβ1, 1) d1 !zi Z −x0_2iβ2 −∞ Z ∞ −∞ f (1, 2) dv1d1 !1−zi . (2.13)

Now the conditional bivariate distribution of the error terms can be approximated by a hermite series, to estimate the parameters in the ML (2.13).

h() = P_K2()φ2(|Σ). (2.14)

Here PK() is a polynomial of degree K and φ(|Σ) is the normal density function with mean

0 and covariance matrix Σ. The CDF h(·) belongs to class Hk CDF’s. Proporties of that class

are described in Gallant and Nychka (1987), but in general all important distributions like chi-squared and t distribution are included. Although this method uses least assumptions compared to the other methods, it is not often applied in practice, since it is rather complicated.

Meurs (2016) has analysed the methods based on Cosslett (1991), Newey (1999) and Gallant and Nychka (1987) in his bachelor’s thesis and concluded that Cosslett’s method gave often more concentrated estimators for various distributions than the other methods.

1_{Although Cosslett does not use probit estimates in his own paper, we use Hussinger’s (2008) simplification}

(11)

CHAPTER 2. THEORETICAL BACKGROUND 6 Another popular method is the use of Robinson’s (1988) difference estimator (Cameron and Trivedi, 2005). Taking the conditional expectation in (2.2) yields:

E(y2i|y1i∗ > 0) = E(x 0

2iβ2|y∗1i> 0) + E(2|1> −x01iβ1), (2.15)

where ˆE(y2i|y∗1i> 0) and ˆE(x2i|y∗1i> 0) are calculated in the first step and substracting (2.15)

from (2.3) yields:

y2i− Ê(y2i|y1i∗ > 0) = (x2i− Ê(x2i|y1i∗ > 0))0β2+ 2− Ê(2|1 > −x01iβ1), (2.16)

for which it is assumed that 2− ˆE(2|1> −x0_1iβ1) is normally independently and indentically

distributed, so that OLS in (2.16) gives consistent estimators for β2.

Although these methods seem to work reasonably, these methods still face problems. For Cosslett (1991) requires that at least one of the regressors in x1cannot be included in x2(Powell,

1994), while Robinson (1988) requires that x1 and x2 cannot have any regressors in common

(Leung and Yu, 1996). So, there is still room for improvement. So, it might be interesting to look at methods from the field of machine learning, and how they have dealt with sample selection.

2.2 Machine Learning

Machine Learning aims to find patterns in datasets (Bishop, 2006). It is the use of automatic discoveries of regularities in data through the use of computer algoritmes. Since the last century computer power has increased and the field has developed exponentially. Nowadays, machine learning seems applicable for a wide range of problems: from recognizing digits to analysing texts. Popular techniques are classification and regression trees (CART), random forests, neural nets and support vector machines (Varian, 2014).

Although econometrics generally focuses more on inference, machine learning is more con-cerned with prediction. Since prediction is highly influenced by the datasets used as input, overfitting is a typical problem (Bishop, 2006). As a result, common practice in machine learn-ing is to split the dataset into a trainlearn-ing and test set. Subsequently, the trainlearn-ing set is used to optimize the model and the test set is used to evaluate out-of-sample performance. For sample selection in machine learning problems, the problem is often defined as follows (Zadrozny, 2004): Let examples (x, y, s) be independently drawn from a distribution D with domain X x Y x S, where X is the feature space, Y is a (discrete) label space and S is a binary space. The variable s controls the selection of examples, where 1 means sample is selected, 0 if not. The following 4 cases can be considered from this problem definition (Zadrozny, 2004):

1. If s is independent of x and y, the selected sample is not biased and the sample is a randomly selected from D. This is the same as ρ = 0 in sample selection problems (2.6). 2. If s is independent of y given x, the selected sample is biased, but the biasedness only

(12)

3. If s is independent of x given y, the selected sample is biased but the biasedness depends only on the label y. This correspondends to a change in prior probabilities of the labels. 4. If no independence assumption holds between x, y and s, the selected sample is biased

and we cannot hope to learn a mapping from features to labels using the selected sample, unless we have acces to an additional feature vector xs that controls the selection.

In machine learning, often case 2 and 3 is reported and selected (Zadrozny, 2004; Elkan, 2001; Bishop, 1995). This area of study is called transfer learning (Dai et al., 2009). Pan and Yang (2010) published a througouh review on transfer learning. In econometrics, however, the Type II Tobit model does not include specific assumptions between decision and outcome variables, which is case 4. Since the problem definition and thus assumptions differ, the methods of machine learning and econometrics differ as well.

2.3 Final Remarks

Historically, economists have dealt with relatively small amounts of data, but that is changing as more detailed data becomes available (Einav et al., 2013). Nowadays we often have several gigabytes of data, or several millions of observations. So the use of new algoritmes from the field of machine learning, like support vector machines, neural networks and random forests become popular (Mullainathan and Spiess, 2017). That is because they can efficiently find complex patterns in large datasets.

Nevertheless, machine learning seems to be primarily concerned with prediction (Varian, 2014). In contrast, econometricians, statisticians, and data mining specialists are generally looking for insights that can be extracted from the data. Consequently, we have seen differences in approaching selection bias in the previous sections: econometricians often use the linear regression model to acquire insights from the data, so sample bias is corrected in such a way that linear regression remains possible. The solution of the machine learning to data selection problems is more data-driven and focused on prediction. Methods involve getting tons of new data and merging those with existing (labeled) datasets, to improve prediction (Zadrozny, 2004; Dai et al., 2009). Not surprisingly, assumptions in most sample selection problems and methods differ much between econometrics and machine learning.

Although this sharp contrast between machine learning and econometrics did not lead to fruitfull collaboration between the two field so far, I propose new improvements on sample selec-tion models by using machine learning techniques. The idea is to use the predicselec-tion capabilities of machine learning techniques to improve a part of the sample selection model, without losing the ability to do valid inference. In the following chapter I describe, analyse, and optimize these new methods.

(13)

Chapter 3

The Model

In this chapter two new methods are introduced and tested for solving sample selection bias. In Section 3.1 the models are described. In Section 3.2 they are analysed under different assumptions, like the amount of parameters and different kind of distributions. In Section 3.3, the implications of the results are made for the models. A thirth method, which estimates the complete Tobit Type II model with a neural network, was eventually not included in this part of the thesis. In Appendix A that method is explained and discussed.

3.1 New Methods

The new models are based on the conditional expectation of (2.2):

E(y∗_2i|x2i, y∗1i> 0) = x02iβ2+ E(2|1 > −x01iβ1). (3.1)

An alternative way to express it, is to write the last term as a nonlinear function g(·) of x1i and

β1 (Cameron and Trivedi, 2005):

E(y∗_2i|x_2i, y∗_1i> 0) = x0_2iβ2+ g(x1i, β1), (3.2)

then g(x1i, β1) can be nonparametrically approximated, as explained in Chapter 2. In the

first method, like Cosslett’s method, dummies are used to approximate the nonlinear function g(·). However, in contrast with Cosslett’s method, the dummies are not constructed by probit estimates of the decision equation (2.1), but by neural network estimates. In the second method, g(x1i, β1) is directly approximated with a neural network.

3.1.1 Cosslett’s Neural Network

The first method is based on the semiparametric estimation method of Cosslett (1991). The decision equation (2.1) is estimated with probit and subsequently dummies are constructed and added to the outcome equation (2.2; see Section 2.1.2). In the new method, the decision equation is estimated with a neural network, dummies are constructed and added to the outcome equation:

(14)

y2i= x02iβ2+ M

X

m=1

bmDim(φ(x1i, ˆwk)) + C2i, (3.3)

with φ(·) the neural network, input x1i and k weigths wk estimated for binary decision to

participate or not (2.1). So, the dummies Dimare based on estimates of the neural network.

Under the assumption that a neural network might more accurately predict the decision than probit, the dummies might better approximate the true g(·) and therefore the dummies might improve the consistency of the estimated parameters β2. This assumption will be tested

in the next section.

A neural network is defined as a network of nonlinear functions (Bishop, 1995). These nonlinear functions will calculate an output variable given some input and weigths. When the output variable is known, weights can be optimized such that it minimizes the error function. For a binary decision task, the cross-entropy error function is used (Bishop, 2006):

E = −

n

X

i=1

{t_iln(y1i) + (1 − ti)ln(1 − y1i)}, (3.4)

where ti is the observed decision to participate. tiis 1 if y2iof (2.3) is observed and 0 otherwise.

y1i= y1i(xi, wk), denotes the predicted probability of participation given input x1iand weigths

wk. For regression, a commonly used error function is squared sum of errors. For a more

detailed explanation of neural networks, it’s parameters and performance, see numerous text books and articles (Abraham, 2005; Yegnanarayana, 2009; Bishop, 2006; Zeng, 1999).

3.1.2 Two-Stage Neural Network

The second method approximates g(·) directly with a neural network. Since neural networks have the ability to find any kind of patterns (Mullainathan and Spiess, 2017), it might approx-imate g(·) reasonably well given x1i. First an OLS will be performed on the outcome equation

(2.3):

y2i= x02iβˆ2+ ˆ2i. (3.5)

As seen in Section (2.1), these results will be biased, since in (2.5) E(2| − 1 > x01iβ1) 6= 0.

From this OLS, the residuals ˆ2i will be taken and a neural network φ(·) will estimate ˆ2i based

on x1 and weigths wk:

ˆ

2i= φ(x1i, wk) + ξ2i, (3.6)

where ξ2i is the error or misfit between neural network estimate and ˆ2i. In the next step, the

neural network with estimated weights φ(x1i, ˆwk) will be added to the original OLS equation

(2.3):

(15)

CHAPTER 3. THE MODEL 10 This equation is similar to the equation derived by Amemiya (1985, p.386) for generalized Type II Tobit models:

y2i= x02iβ2+ σ12σ₁−2λ(x1i, β1) + ξ2i, (3.8)

where the inverse Mill’s ratio λ(x1i, β1) and covariance σ12and variance σ21 is used to

approxi-mate the nonlinear function g(·). Amemiya (1985) argues that β2 can be consistently estimated,

as long as y_1i∗ is normally distributed and ξ2iis identically distributed of y_1i∗. In (3.7) φ(x1i, ˆwk)

approximates σ12σ1−2λ(x1i, β1), but without inverse Mill’s ratio which depends on normality

of y_1i∗. Consequently it is assumed that β2 is reasonably approximated if ξ2i is independently

distributed of x1i.

In the next section these two methods are tested and compared with older semiparametric estimation techniques.

3.2 Monte Carlo Analysis

This section is in four parts. Firstly, the settings for the simulation study are described. Secondly, it is investigated under what conditions a neural network performs better than a probit, for the binary prediction whether a person participates or not. In the thirth section the new methods will be compared to Heckman’s model and other semiparametric estimation techniques, for a variety of conditions. In the final section implications for the analysis of the empirical dataset will be presented.

3.2.1 Set-up

This simulation study is to find the performance of the models, before applying it to the MEPS (2018) dataset. Therefore, settings of the data-generating process (DGP) should try to mirror medical expenditure datasets. Although a thorough review of the dataset is given in the next chapter, some results are presented here.

In the MEPS (2018) dataset, we wish to predict medical expenditure, which is zero for around 21% of the dataset. Furthermore, different signal-to-noise ratio’s (SNR) are used in previous simulation studies of medical expenditure. Malehi, Pourmothahari and Angali (2015) used SNR’s of 1 to 2. Lu, Zhou, Naylor and Kirkpatrick (2017) used SNR’s of 1, 3 and 5. Basu and Manning (2009) reviewed literature about estimated medical expenditures and concluded that most models have a R2 of 30%, which is approximately a SNR of 0.45.

From this, the following models and parameters are initially used for the Monte Carlo simulation:

y∗₁ = α0+ α1x + α2z + u (3.9)

(16)

y2 =      y₂∗ for y₁∗ > 0 0 for y₁∗ < 0 , (3.11)

with x, z and q vector’s with N random numbers from a N (0, 1)-distribution. Sample size N is 10000 in the estimation of the first step, and 1000 in the second step, due to computation time. α1=

√

0.5, α2 =

√

0.5, β1 = 2, β2 = 1, so that the SNR of (3.9) is 1 and (3.11) 3.33. α0

is the 0.2 quantile from (3.9), so that 80% of the data is above 0. The error terms u and v are correlated by:

v = ρu + , (3.12)

with independently normally distributed, ρ the correlation, and the error terms u and v are standardized so that they have mean 0 and variance 1.

Regarding the hyperparameters of the neural network: for both methods with neural network there are 3 layers: 5 input nodes, 8 hidden nodes in the second layer and 1 output node. Input and hidden layers had activation functions tanh. For Cosslett’s Neural Network (COSSNN) sigmoid was used as output function, and tanh for Two-Stage Neural Network (2NN). With trial and error 30 epochs seemed to be a good trade-off between computation time and an acceptable accuracy, since the cost function did not seem to change much when amount of epochs was added. Validation split of 0.2 was used, and batch size of 130. Loss function for the COSSNN was binary cross-entropy, with optimizer adam’s stochastic gradient descent (Kingma and Ba, 2014) and metric accuracy. For 2NN loss function was mean squared error, optimizer stochastic gradient descent and metric mean absolute error. To prevent overfitting, a L2-regularizer is added. Initially it was set to 0.05 by hand, obtained by trial-and-error, but in the next section it will also be compared by selecting a value based on cross-validation. Furthermore in-build validation split function is used to further reduce overfitting, which was set to 0.2. In this case that means the last 20% of the data was used to validate the neural network on accuracy in every epoch.

For both Cosslett’s methods the amount of dummies was set to M = 25 (Hussinger, 2008). analyses were done with R 3.4.3 on a computer with CPU i7-8700 and 16 GB RAM. For Heckman’s method R-packages sampleSelection was used. The neural network was build with Python library Keras in R, using the Anaconda environment.

3.2.2 First-Step Estimation

For the neural network to perform better than the probit in Cosslett’s model, we first consider for which conditions the neural network better predicts participation than a probit. Since probit is estimated with maximum likelihood, it is the most efficient and consistent estimator under correct specification of the data generating process (Cameron and Trivedi, 2005). Therefore we introduce misspecification in two ways: by changing the distribution of u and by making the relationship between the regressors and y₁∗ nonlinear.

(17)

CHAPTER 3. THE MODEL 12 a centered chi-squared distribution, t distribution and skewed t distribution. The chi-squared distribution is asymmetric, t distribution has fatter tails than the normal distribution and the skewed t distribution is a combination of the above, as proposed by Hansen (1994). For t and skewed t 5 degrees of freedom were used, for the chi-squared 3. All distributions are standardized to have mean 0 and standard deviation 1. Nonlinearity will be introduced by changing (3.9) to:

y∗₁ = α0+ α1x2+ α2z2+ u (3.13)

y∗₁ = α0+ α1x2+ α2z2+ α3xz + u, (3.14)

with x and z from a N (0, 1)-distribution, α1 =

√

0.5 α2 =

√

0.5 α3 = 0.5 and variance of u

changed in such a way the SNR remains 1 for both (3.13) and (3.14). See Figure 3.1a and 3.1b for an illustration of the relationship between regressors and y∗₁ in (3.9) and (3.14) respectively. y1 is set to 0 if y1∗ < 0 and 1 if y1∗ > 0. The sample size N=10000, which is split in a training set

of 8000 and test set of 2000. For both probit and neural network predictions will be compared by area under the ROC-curve, which is a common metric to compare models on binary prediction task (Hanley and McNeil, 1983). Results are presented in Table 3.1.

For the linear case both methods score well with error terms drawn from different dis-tributions, with AUC’s around 0.9. However, for the nonlinear case with and without cross products, neural network has higher AUC scores than probit. To see this also holds for other kinds of nonlinear relationships, the following equations will be tested as well:

y₁∗= α0− (α1x2+ α2z2+ α3xz) + u (3.15)

y₁∗= α0+ α1ex+ α2ez+ u

p

e(e − 1). (3.16)

(3.15) has a downward parabola relationship, which might change the results since the quadratic relationship below treshold y₁∗ = 0 might be harder to estimate (Figure 3.1c). In (3.16) regressors now follow a lognormal distribution (Figure 3.1d and 3.1e). u is multiplied by pe(e − 1) to keep SNR 1. Results are presented in Table 3.2.

For a downward parabola relationship the neural network has higher AUC’s than with an upward parabola relationship. For probit it is around 0.5, which is equivalent to a flip of a coin. The lognormal relationship was however reasonably estimated by both probit and neural network, with AUC’s between 0.700 and 0.800 for the different distributions, and no large difference between the two. Adding a crossterm α3xz to the lognormal equation (3.16) with

error terms bivariate normal distributed did not change the result that much: for probit the AUC was 0.681 and for neural network 0.679. So the lognormal might be relatively insensitive to other nonlinear relationships in the equation. The relative good performance of lognormal regressors for both probit and neural network can be explained by looking at the Taylor series for f (x) = ex around point a = 0 (3.17):

ex≈ x 0 0! + x1 1! + x2 2! + ... = ∞ X k=0 xk k!. (3.17)

(18)

The Taylor series shows that the lognormal regressors can be reasonably approximated with linear terms or with squared terms around point a = 0. Since x has mean 0, we would indeed expect that probit with linear terms reasonably approximates ex. The neural network would then be slightly more accurate than the probit, since the capability of the neural network to find nonlinear relationships would make the approxmation of ex quadratic. So, the Taylor series explains the observed results of AUC for both probit and neural network when estimating lognormal regressors.

According to Richards and Doyle (2000) a neural network in a binary prediction task is senstive to the SNR: lower SNR significantly reduced the accuracy of the neural network. So far, we have used a SNR of 1, or equivallently a R2 of 50%. As stated in the introduction, for models estimating medical expenditure R2 is around 30%, which is approximately a SNR of 0.43. So, it is interesting to test for lower SNR’s the predicting ability of the neural network. Starting from (3.14), the SNR will be dropped to 0.5 and 0.25 by changing the variance of u. Results are presented in Table 3.3. SNR of 0.5 indeed shows lower AUC’s for the neural network then for SNR of 1, and SNR of 0.25 shows lower AUC’s then both SNR of 0.5 and 1. From this follows that lowering SNR lowers accuracy for the neural network. However, AUC was generally still larger than probit in all three cases. For a SNR of 0.43 in health expenditure dataset, it is still expected the neural network will outperform probit when nonlinearities are present.

It is also important to consider overfitting when using the neural network. So far, two penalities were used: the L2-regularizer and cross-validation. The penalities prevent overfitting, but result in bias. Therefore changing the two penalities might result in a better trade-off between overfitting and bias. The L2-regularizer was randomnly chosen to be 0.05, but different values can be considered by means of a grid search. In addition, a built-in feature of keras called validation split of 0.2 was used when tuning the model. It means the last 20% of the input is used to validate the model. Another common approach is using cross-validation, where validation is done by splitting input in small samples and accuracy of the model for each sample as validation set is calculated and averaged. In this case we will use stratified cross-validation, since it takes into account the imbalance in the dataset when selecting the small samples. The results of using cross-validation or validation split, for normally distributed error terms and upward parabol relationship (3.14), are presented in Table 3.4 and illustrated in Figure 3.1f. Differences between validation split and cross-validation are very small for most values of the L2-regularizer. Furthermore, value of L2-regularizer did not differ that much when it remains lower then 0.20.

Finally, the input of the neural network is studied. Nonlinear regressors are added as input to the neural network, but not to the probit. The reason is that the neural network would perform equally well without the nonlinear regressors, since adding hidden layers could substitute for the nonlinear relationship (Cybenko, 1989; Hornik, 1991). To save the time tuning every model, the nonlinear regressors are just added as input, but it is important to check that tuning the neural network without nonlinear regressors would indeed result in an AUC equal to

(19)

CHAPTER 3. THE MODEL 14 a neural network with nonlinear regressors. For normally distributed error terms and upward parabol relationship between regressors (3.14) and y₁∗ the AUC of the neural network is 0.788 and for probit 0.513 (Table 3.1). The input of the neural network is changed to only x and z, so not including squares and cross terms. The neural network is changed so that the input layer has 2 nodes, first hidden layer 5 nodes, second hidden layer 5 nodes, thirth hidden layer 5 nodes and output layer with 1 node. The resulting AUC was 0.764, which might increase to 0.788 by further tuning the model. So, tuning the model would have same effect as adding nonlinear regressors. So that would justify the used procedure of adding nonlinear regressors to the neural network but not to the probit.

Linear Nonlinear Nonlinear+Cross

Probit NN Probit NN Probit NN

1. Normal 0.849 0.850 0.512 0.801 0.513 0.788

2. Chi-squared 0.926 0.926 0.478 0.864 0.485 0.852

3. t(5) 0.924 0.924 0.499 0.874 0.522 0.792

4. Skewed t(5) 0.890 0.889 0.501 0.835 0.510 0.825

Table 3.1: AUC of probit and neural network under different distributions u and nonlinear relationships y₁∗ and regressors.1

Parabol Up Parabol Down Lognormal regressors

1. Normal 0.513 0.788 0.544 0.927 0.694 0.706

2. Chi-squared 0.485 0.852 0.500 0.976 0.801 0.816

3. t(5) 0.522 0.792 0.490 0.926 0.700 0.715

4. Skewed t(5) 0.510 0.825 0.479 0.958 0.757 0.777

Table 3.2: AUC of probit and neural network under different distributions u and nonlinear relationships y₁∗ and regressors.2

SNR=1 SNR=0.5 SNR=0.25

1. Normal 0.513 0.788 0.508 0.696 0.514 0.606

2. Chi-squared 0.485 0.852 0.482 0.763 0.483 0.647

3. t(5) 0.522 0.799 0.521 0.716 0.519 0.652

4. Skewed t(5) 0.510 0.825 0.522 0.729 0.525 0.632

Table 3.3: AUC of probit and neural network under different distributions u and SNR’s. 1_{Linear as in (3.9), Nonlinear as in (3.13) and Nonlinear+Cross as in (3.14). SNR=1 and N =10000.}

2_{Parabol Up as in (3.14), Parabol Down as in (3.15) and Lognormal regressors as in (3.16). SNR=1 and}

(20)

1 2 3 4 5 6 7 8 9

L2 0 0.001 0.005 0.01 0.05 0.10 0.15 0.20 0.50

VS 0.791 0.792 0.792 0.792 0.787 0.785 0.784 0.784 0.706 CV 0.791 0.792 0.792 0.792 0.789 0.787 0.784 0.782 0.654

Table 3.4: AUC under different values L2-regularizer (L2) for validation split (VS) and stratified cross-validation (CV).3

(21)

CHAPTER 3. THE MODEL 16

(a) Linear relationship.

(b) Nonlinear + Cross relationship.

(c) Parabol Up relationship. (d) Lognormal relationship.

(e) Lognormal relationship different angle. (f) AUC of CV and VS for different L2 values.

Figure 3.1: (a-e) different relationships x, z and y∗₁. (f ) AUC different values L2-regularizer (L2) for validation split (VS) and stratified cross-validation (CV).

(22)

3.2.3 Second-Step Estimation

Having studied the conditions for which a neural network outperforms a probit in the first step, we now study the implications for the general two-step model, where the outcome equation is estimated with a selection bias. Four different kind of manipulations are analysed. Firstly, for different kind of correlations between decision and outcome equation (3.8 and 3.10). Secondly, distributional assumptions are tested by looking at four different distributions for the error terms. Thirthly, we change the SNR of either step 1, step 2 or both. Finally, nonlinearity in step 1 and both steps is studied.

For OLS, Heckman’s method (OLSH), Cosslett’s method with probit (COSS), Cosslett’s method with neural network (COSSNN) and two-stage neural network (2NN) estimates will be presented and can be compared with true parameter values β1 = 2 and β2= 1. The presented

estimates will be the average ˜β of the estimated ˆβ in S=100 samples and the standard error is defined as the standard deviance of ˜β in the 100 samples. Since OLS is nested in all methods, a LR-test can be performed to test the extra parameters jointly contribute significantly to the model. We will test with a significance level of α = 0.05 and count how often H0 is rejected of

the extra parameters not jointly contributing to OLS in the S=100 samples. Note that this is equivalent to performing a F-test for joint significance of the extra parameters.

For different correlations between residuals and therefore different bias between decision and outcome equation, the different methods are tested. In Table 3.5 results are presented. OLS seems to have more bias when ρ becomes larger, what would be expected. Since higher ρ means more dependency between decision and outcome equation, and therefore more selection bias. All methods seem to correct reasonably for this bias, with the same standard error, except 2NN. 2NN does not seem to correct for the bias, and the bias tends to be related to the bias when doing OLS. In Table 3.9 a LR-test statistics can be found. Without bias (ρ = 0) the methods do not contribute significantly to OLS, and with bias they do, what was to be expected, since without bias OLS gives consistent estimators. The 2NN method also contributes significant parameters, but do have biased results. This might be because the error is a function of x1 and

therefore it significantly predicts it, while not changing the estimates for x2.

Keeping ρ = 0.7, we now change the bivariate distribution of the error terms. From Table 3.6 we can conclude that the bias remains constant for OLS, which might be the result of a SNR of 3.333, which is relatively high. OLSH and COSS seem to correct for the bias, but COSSNN bias seems already be a few percentage higher. The 2NN estimates show bias, which looks related to the OLS bias. Looking at Table 3.9 the LR-test statistics show that OLSH performs better than the other methods for all distributions. COSS seems to perform better when error terms are t and skewed t distributed. 2NN has parameters which significantly contribute around 50 times out of the 100 samples. Nevertheless, the bias of 2NN presented in Table 3.6 seems to make the LR-test irrelevant.

For the neural networks to work well, the SNR is an important measure, as we studied in the previous section. Therefore different combination of SNR’s are tested in both equations.

(23)

CHAPTER 3. THE MODEL 18 Results are presented in Table 3.7. SNR1 is defined as the SNR of the decision equation (3.9) and SNR 2 as the SNR of the outcome equation (3.10). The OLS estimates show bias for all combinations of SNR, but this is lower for SNR1 and SNR2 equal to 10. This might be a result of bias being part of the error (3.1), and therefore small error might be smaller bias. In addition, SNR1=1 and SNR2=1 seems to give biased estimates for all methods. This might be the result of a relative small signal of the parameters in the outcome equation so the methods could correct less for the bias, which might also explain biased results for SNR1=3.33 and SNR2=1. For SNR1=1 and SNR2=3.33 OLSH, COSS and COSSNN seem to correct for bias best, so strong enough signal in the outcome equation compared to the amount of selection bias. 2NN method gives more bias compared to the other methods for all the SNR combinations. Note the bias seems to be correlated to the bias in the OLS estimates. Table 3.9 illustrates the LR-tests, which comfirmes the hypothesis that the SNR1=1 and SNR2=3.33 gives most often unbiased estimates, because out of 100 samples, the LR-test with nulhypothesis that extra parameters jointly do not significantly contribute is rejected 100 times.

Next to SNR, nonlinearities also played an important part in the previous section for the performance of the neural network. We will study the consequences of either making the first step nonlinear (3.18), or both steps (3.18 and 3.19).

y₂∗= α1x2+ α2z2+ α3xz + u (3.18)

y₂∗= β0+ β1x2+ β2q2+ β3xq + v, (3.19)

with α and β chosen in such a way SNR1 remains 3.33 and SNR2 remains 1.4 _{Figure 3.2}

illustrates the nonlinearity between x, q and y₂∗. Table 3.8 presents results for nonlinearity in the first step and both steps. Compared to the linear case, nonlinearity in the first step leads biased estimates by OLSH, but COSS, COSSNN and 2NN giving estimates not more than 1 standard deviation from the true parameter values. 2NN and COSNN seem to be more efficient than COSS. Nonlinear in first and second step leads to estimates far from true parameter values for all methods. One reason might be that all methods are variants on OLS, so that nonlinearity in second step is badly estimated since the extra parameters do not correct for the OLS in the second equation when there is a nonlinear misspecification of the outcome equation. LR-test results from Table 3.9 confirm that when having nonlinearity in the first step, COSSNN and 2NN work better. However, for both steps nonlinear all estimated parameters seem biased (Table 3.8). So for the nonlinear case the results of the LR-test in Table 3.9 are probably not reliable indicators of comparison between OLS and other bias correcting methods.

The two new methods seem not to perform well when there is no nonlinearity present, compared to the older methods. For the two Cosslett’s methods it is interesting to see how well accuracy of first-step prediction compares to second-step estimation. Taking a sample from the analysis with nonlinearity in step 1+2 of Table 3.8, high prediction accuracy did not seem to increase chances of rejection of the statistic of the LR-test. The AUC of COSS was 0.532, and for COSSNN 0.572, but p-value of the LR-test was 0.4087 for COSS and for COSNN 0.636. So higher AUC did in this sample not result in lower p-value in the LR-test. Another measure

(24)

could be P(y∗

1 − ˆy1∗)2, which in empirical datasets would not be possible because y∗1 is not

observed, could maybe better predict performance in the second step. For probit this metric was 148.73. For the neural network this was 174.14. So probit had lower residual sum of squares than the neural network. From this we might conclude AUC is in insufficient metric to predict which method has a lower p-value on the LR-test in the second step.

An assumption made in Section 3.1.2 about the 2NN method, is that if the regressors are uncorrelated with the ξ’s, then the estimates would approximate the true β1 and β2 reasonably

well. This does not seem to hold, since ˆρx,ξ = 0.0229 and ˆρz,ξ = 0.0017 for when ρ = 0.7,

error terms bivariate normal distributed and SNR1=1 and SNR2=3.333. So the regressors and ξ do not seem correlated, and still the results show a bias. However, the 2NN method seems to estimate β2 reasonably well compared to the estimates of β1. It might be that 2NN only

estimates well if the regressors in decision and outcome equations (3.9 and 3.10) are unique. (3.10) is modified so that regressors in decision and outcome equation have unique regressors:

y∗₂ = β0+ β1w + β2q + v, (3.20)

with w a vector with N random numbers from a N (0, 1)-distribution. Now estimating with error terms drawn from a standardized bivariate normal distribution with ρ = 0.7 and SNR1=1 and SNR2=3.33 led to ˜β1 = 1.994 and ˜β2 = 1.003, which is a good improvement to ˜β1= 1.925

and ˜β2 = 1.000 when there was overlap in regressors between decision and outcome equation

(Table 3.5). So this might explain the bad performance of the 2NN method.

Another suprising result is that the parameter β2, for which the regressor q is only present

in the outcome equation (3.11) but not in the decision equation (3.9), is well estimated by all analysed methods under all analysed conditions. Only the regressor x with parameter β1, which

was present in both decision and outcome equation gave biased estimates when performing OLS on the outcome equation. This is suprising since truncation and misspecifying the distribution of the error terms should make all estimates inconsistent (Heij et al., 2004). One explanation might be that the SNR2 of 3.33 is to high for the bias of residuals to have an effect on estimates of regressors only present in outcome equation. This hypothesis might be confirmed by looking at Table 3.7, since those results show that lowering SNR2 to 1 made the bias in β2 larger for

(25)

ρ OLS OLSH COSS COSSNN 2NN

0 β˜1 2.000 (0.12) 2.000 (0.13) 1.999 (0.14) 1.999 (0.14) 2.000 (0.13) ˜ β2 1.001 (0.12) 1.001 (0.12) 1.001 (0.12) 1.001 (0.12) 1.001 (0.12) 0.3 β˜1 1.969 (0.12) 1.998 (0.13) 1.998 (0.15) 1.997 (0.14) 1.967 (0.17) ˜ β2 0.998 (0.11) 0.998 (0.11) 0.998 (0.11) 0.998 (0.11) 0.998 (0.11) 0.7 β˜1 1.932 (0.12) 2.002 (0.12) 2.001 (0.13) 2.001 (0.13) 1.925 (0.20) ˜ β2 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 0.95 β˜1 1.908 (0.12) 2.001 (0.14) 1.999 (0.15) 1.999 (0.15) 1.894 (0.26) ˜ β2 1.002 (0.11) 1.002 (0.11) 1.002 (0.11) 1.002 (0.11) 1.002 (0.11)

Table 3.5: Estimated parameters outcome equation for different ρ’s and methods.4

Distribution OLS OLSH COSS COSSNN 2NN

Normal β˜1 1.932 (0.12) 2.002 (0.12) 2.001 (0.13) 2.001 (0.13) 1.925 (0.20) ˜ β2 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) Chi-squared β˜1 1.943 (0.09) 2.009(0.10) 2.003 (0.11) 1.991 (0.30) 1.934 (0.10) ˜ β2 0.998 (0.10) 0.999 (0.10) 1.000 (0.10) 0.998 (0.14) 0.997 (0.12) t(5) β˜1 1.940 (0.05) 2.004 (0.05) 1.999 (0.05) 1.990 (0.07) 1.923 (0.05) ˜ β2 1.003 (0.04) 1.004 (0.04) 1.004 (0.04) 1.003 (0.04) 1.004 (0.06) Skewed t(5) β˜1 1.937 (0.04) 2.012 (0.06) 2.007 (0.06) 1.992 (0.07) 1.935 (0.07) ˜ β2 1.003 (0.05) 1.002 (0.05) 1.001 (0.04) 1.002 (0.05) 1.002 (0.05)

Table 3.6: Estimated parameters outcome equation for residuals drawn from different standard-ized distributions.5

4

Estimated values ˜β are average of estimates ˆβ in S=100 samples, with between parentheses the standard error. N =1000. α1 =

√

0.5, α2 =

√

0.5, α3 = 0.5, β1 = 2, β2 = 1. Error terms from standardized bivariate

normal distribution. Variance of residuals manipulated so that SNR1=1 and SNR2=3.33.

5

Estimated values ˜β are average of estimates ˆβ in S=100 samples, with between parentheses the standard error. N =1000. α1=

√

0.5, α2 =

√

0.5, α3= 0.5, β1 = 2, β2= 1. ρ=0.7. SNR1=1 and SNR2=3.33. Error terms

(26)

Distribution OLS OLSH COSS COSSNN 2NN SNR1=1, SNR2=3.33 β˜1 1.932 (0.12) 2.002 (0.12) 2.001 (0.13) 2.001 (0.13) 1.925 (0.20) ˜ β2 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) SNR1=3.33, SNR2=1 β˜1 1.915 (0.14) 1.992(0.14) 1.975 (0.15) 1.965 (0.18) 1.908 (0.16) ˜ β2 0.978 (0.13) 0.979 (0.13) 0.979 (0.13) 0.979 (0.13) 0.979 (0.13) SNR1=1, SNR2=1 β˜1 1.762 (0.14) 1.983 (0.15) 1.978 (0.15) 1.984 (0.23) 1.732 (0.21) ˜ β2 0.981 (0.12) 0.979 (0.12) 0.979 (0.13) 0.979 (0.13) 0.980 (0.12) SNR1=10, SNR2=10 β˜1 1.984 (0.02) 1.998 (0.02) 1.995 (0.03) 1.993 (0.03) 1.986 (0.03) ˜ β2 0.996 (0.02) 0.996 (0.02) 0.997 (0.02) 0.996 (0.02) 0.996 (0.02)

Table 3.7: Estimated parameters outcome equation for different combinations SNR decision and outcome equation.6

OLS OLSH COSS COSSNN 2NN

Linear β˜1 1.932 (0.12) 2.002 (0.12) 2.001 (0.13) 2.001 (0.13) 1.925 (0.20) ˜ β2 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) 1.000 (0.11) Nonlinear step 1 β˜1 1.921 (0.06) 2.040 (0.33) 2.002 (0.13) 1.996 (0.05) 2.004 (0.05) ˜ β2 0.991 (0.04) 0.991 (0.37) 0.991 (0.04) 0.991 (0.04) 0.991 (0.04) Nonlinear step 1+2 β˜1 -0.008 (0.25) 0.992 (14.68) -0.086 (0.85) 0.003 (0.36) 0.003 (0.34) ˜ β2 0.004 (0.14) 0.007 (0.14) 0.001 (0.14) -0.009 (0.13) -0.009 (0.12)

Table 3.8: Estimated parameters outcome equation for different combinations nonlinearity in decision and outcome equation.7

6

√

0.5, α2=

√

0.5, α3 = 0.5, β1= 2, β2= 1. ρ=0.7. Error terms drawn from standardized

bivariate normal distribution. SNR1 is SNR decision equation and SNR2 of outcome equation.

7

√

0.5, α2=

√

0.5, α3 = 0.5, β1= 2, β2= 1. ρ=0.7. Error terms drawn from standardized

bivariate normal distribution. Nonlinear step 1 as specified in (3.18) and nonlinear step 1+2 as specified in (3.18) and (3.19). α1 = 1.5, α2 =√2.5, α3 =√2.5 and β1= 2, β2= 1, β3 = 1. Now u is divided by 15 and v by 4, so

(27)

Parameters Heckman Cosslett CosslettNN 2NN

ρ 0 7 1 2 38 0.3 100 99 100 99 0.7 100 100 100 100 0.95 100 100 100 100 Distribution Normal 100 100 100 100 Chi-squared 58 17 13 49 t 97 55 30 68 Skewed t 100 78 57 61 SNR SNR1=1, SNR2=3.33 100 100 100 100 SNR1=3.33, SNR2=1 77 35 6 48 SNR1=1, SNR2=1 99 63 32 86 SNR1=10, SNR2=10 78 39 20 12 Nonlinearity Not 100 100 100 100

First step nonlinear 6 36 51 77

Both steps nonlinear 27 80 100 100

Table 3.9: How often LR-test rejects H0 of not joint significant extra parameters out of S = 100

samples, for different manipulations.

(28)

3.3 Conclusion

This chapter was in two parts. First we tested the binary prediction task to participate or not with probit and neural network. Misspecification of the distribution of the error terms did not have an effect on the AUC of both probit and neural network. Introducing nonlinearities between regressors and outcome variable resulted in high AUC for the neural network, but low AUC for probit. This result seemed to be less strong when nonlinearity is specified by regressors drawn from a lognormal distribution, since both methods got AUC’s above 0.7. In addition, changing SNR did have a large effect on the neural network: lowering the SNR resulted in lower AUC. Changing hyperparameters like L2-regularizer or changing the way to validate the neural network did not have a strong effect on AUC.

In the second section analyses were performed on both steps of Heckman’s two-step es-timation. OLSH, COSS and COSSNN performed well when changing ρ or distributions of the error terms. Nonlinearity in first step resulted in good predictions by COSSNN and COSS, while OLSH gave biased estimates. During all analyses 2NN did not perform well, with estimated parameters often close to the parameters estimated by OLS, which was biased. In other words, when the OLS became biased, the 2NN became biased as well, in similar order of magnitude. In subsequent analyses we found that for 2NN to get less biased estimates, regressors for the decision equation should not be included in the outcome equation.

In the next chapter the empirical dataset is described. Afterwards models from this chapter are applied and discussed.

(29)

Chapter 4

Empirical Data

In this chapter we give a description of the experimental dataset used for empirical analysis. By applying the methods to a empirical dataset, we hope that differences between methods in Monte Carlo simulation are also found in a real-world setting. In the first section the dataset is described. In the second section settings for the analyses are presented.

4.1 MEPS

The Medical Expenditure Panel Survey (MEPS, 2018) is an ongoing nationally representa-tive survey of the U.S. civilian non institutionalized population started in 1996 by the U.S. Department of Health and Human Services (MEPS, 2018). Survey on households, employers and medical providers are conducted to collect information regarding estimates of respondents’ health status, health care expenditures, demographic and socioeconomic characteristics, em-ployment, access to care and satisfaction with care.

Data is collected from the website of MEPS1, and at time of completing this thesis, datasets between 1996 and 2016 were available. Datasets of 2010 and earlier miss essential variables including majority of health status variables and are therefore dropped. We use datasets from 2011 till 2016, for which 180.871 observations are available.

We model annual health expenditures. A overview of dependent and independent variables used for the analysis, including coding in the original dataset, is shown in Table 4.1. Some basic descriptive statistics can be found in Table 4.2 for a selection of the variables. The regressors could be grouped into health insurance variables, health status variables and socioeconomic characteristics. Missing observations are imputed by means of maximum likelihood estimation. We focus on modeling two forms of annual individual health expenditure: total (TOT-EXP) and paid out-of-pocket (TOTSLF).For both variables an econometric model needs to take account of two complications: First, health expenditures are zero for large part of the popula-tion: for TOTEXP 21.1% and for TOTSLF 38.1%. Secondly, the TOTEXP and TOTSLF are very right-skewed: for TOTEXP the mean is $3839, which is much larger than the median of

1

https : //meps.ahrq.gov/data stats/download data f iles, extracted on 05/10/2018.

(30)

$557 (see Figure 4.1). Likewise, for TOTSLF the mean is $436 and median of $40 (Figure 4.1). The reason for analysing TOTEXP and TOTSLF are two-fold: Firstly, their is a difference in amount of people having positive TOTEXP and TOTSLF. This different imbalance might impact the models differently, since models might be more biased when their is more imbalance. Secondly, estimated parameters might be different, which also might effect different models in different ways. Estimated parameters might be different for two reasons: Firstly, people who expect expenses might decide to insure themselfes. So that would give positive TOTEXP but no observed TOTSLF. Secondly, if people get sick, they might not go to the doctor if they do not have an insurance. So TOTSLF would be zero for people avoiding the doctor and who are insured, while TOTEXP would then only be zero in the first case. In other words, there might be differences in parameter values between the two health expenditures: positive TOTEXP might be dependent on different factors than for TOTSLF.

For both variables an econometric model needs to take account of two complications: First, health expenditures are zero for large part of the population: for TOTEXP 21.1% and for TOTSLF 38.1%. Secondly, the TOTEXP and TOTSLF are very right-skewed: for TOTEXP the mean is $3839, which is much larger than the median of $557 (see Figure 4.1). Likewise, for TOTSLF the mean is $436 and median of $40 (Figure 4.1).

These complications are well accounted for with sample selection models (Cameron and Trivedi, 2005). Therefore this dataset is a good way to test the new methods and compare their inference and accuracy with the older methods.

Table 4.2 shows that in general 78.9% have positive TOTEXP, and 61,9% positive TOT-SLF. If the difference between the two for a certain regressor becomes larger, then that means less people pay something self, and are therefore better insured. Vice versa, if the difference between the two is small, a large part of the group pays the health expenditures themselves. For instance, people with an excellent general health, pay less health expenditures, but if they do, they more often have to pay themselves since they might have decided not to insure themself, since they are so healthy. Note that the percentage with positive TOTSLF can never be larger then the percentage positive TOTEXP, since if one has to pay health expenditure themselves, they automatically have positive total health expenditure.

Some extra manipulations are performed on the data. So is the natural logaritm taken of the health expenditures (Figure 4.1). For the decision equation two dummies are added: for both dependent variables dummy is zero if dependent variable is 0, and 1 if the dependent variable is larger then 0. With these dummies we seperately perform analysis to compare prediction for decision equation with probit and a neural network, like in Section 3.2. Furthermore, of BMI, Age and FAMINC the squared terms are added to the neural network, also the cross terms between these three variables. These squared and cross products are normalized as input for the neural network. These makes the total amount of regressors 104 for not neural network models, and 110 for methods based on neural network. Furthermore, most categorical variables had besides category yes and no, additional categories like not ascertained, ref used

(31)

CHAPTER 4. EMPIRICAL DATA 26 to answere and inapplicable. All later categories are recoded to the category unknown (see Table 4.1). Since a small percentage of the observations were in this category, we did not expect this recoding to significantly influence the results.

(a) Histogram TOTEXP (b) Histogram TOTSLF

(c) Histogram ln(TOTEXP) (d) Histogram ln(TOTSLF)

Figure 4.1:

Histograms of TOTEXP, TOTSLF, ln(TOTEXP), and ln(TOTSLF).2

2

For ln(TOTSLF), the bin at zero is not a mass point of zeros, but small values around zero. Although ln(TOTEXP) is still a little skewed to the right, and ln(TOTSLF) a little skewed to the left, we consider the transformation with the logaritm sufficient to deal with outliers for both dependent variables.

(32)

Variable Source variable in dataset Description Type Dependent variable

Total expenditure TOTEXP** Total health expenditures Numeric Total expenditure out-of-pocket TOTSLF** Total health expenditure payed by client self Numeric Independent variable

1. Health Insurance

Medicaid MCDEV Did ever have medicaid Binary NeedHI ADINSA42 Do not need health insurance Categorical PrivateHI PRVEV Ever have private insurance Binary Insured INSCOV Insurance coverage indicator Categorical 2.Health Status

Mental Health MNHLTH31 Perceived mental health status Categorical Physical Health RTHLTH31 Perceived health status Categorical Pregnant PREGNT31 Prenancy status Categoricala

Smoker ADSMOK42 Smoking status Categoricala

Limitation1 WLKLIM31 Limitation in physical functioning Categoricala

Limitation 2 ACTLIM31 Any limitation in work or school Categoricala

Limitation 3 UNABLE31 Completely unable to do activities Categoricala

BMI BMINDX31 Body Mass Index Numeric High blood pressure HIBPDX High blood pressure diagnosis Categoricala

Coronary heart disease CHDDX Coronary heart disease diagnosis Categoricala

Angina ANGIDX Angina diagnosis Categoricala

Heart attack MIDX Hearth attack (MI) diagnosis Categoricala

Orthodontist DVORTH16 Did visit orthodontist Binary Stroke STRKDX Stroke diagnosis Categoricala

Emphysema EMPHDX Emphysema diagnosis Categoricala High cholesterol CHOLDX High cholesterol diagnosis Categoricala

Diabetes DIABDX Diabetes diagnosis Categoricala

Arthritis ARTHDX Arthritis diagnosis Categoricala

Asthma ASTHDX Asthma diagnosis Categoricala

ADHD ADHDADDX ADHD diagnosis Categoricala

Bronchitis CHBRON31 Chronic bronchitis Categoricala

Cancer CANCERDX Cancer Categoricala Feet DSFB**53 Had feet checked Categoricala

Eye DSEY**53 Dilated eye exam Categoricala

Flu vaccination DSFL**53 Flu vaccination Categoricala

General Health ADGENH42 General health indicator self-reported Categoricala

3.Socioeconomic Status & Demographics

White RACEV1X Race: white or no-white Binary

Sex SEX Sex Binary

Married MARRY**X Married Categorical

Age AGE16X Age Numeric

Region REGION31 Census region Categorical Occupation OCCCAT31 Occupation group Categorical Language LANGHM42 Language spoken at home English Categorical Student FTSTU**X Is at the moment a student Categorical Military ACTDY31 Is at this moment in the military Binary Wage reported WAGIMP16 Willingness to report wage Categorical Family income FAMINC** Household total income Numeric Doctor visit HAVEUS42 If sick goes to doctor Categoricala

Employed EMPST** Is currently employed Categoricala Table 4.1: Description variables MEPS(2018). Categoricala means there are 3 categories: yes, no or unknown. Other categorical variables have variable-dependent categories. ** means year dependent variable.

(33)

CHAPTER 4. EMPIRICAL DATA 28 N % %TOTEXP>0 %TOTSELF>0 ALL 180871 100 78.9 61.9 2012 38974 21.5 76.5 62.6 2013 36940 20.4 78.8 62.3 2014 34875 19.3 79.3 61.5 2015 35427 19.6 80.4 62.1 2016 34655 19.2 80.0 61.8 Dependent variables Expenditure 0 142779 78.9 0 0 >0 38092 21.1 100 55.2 Expenditure self 0 68965 61.9 44.8 0 >0 111906 38.1 55.2 100 Independent variables 1.Health Insurance Medicaid 1 Yes 56658 31.3 83.0 41.5 2 No 124213 68.7 77.1 71.2 NeedHI 1 Disagree strongly 66636 36.8 86.7 79.7 2 Disagree somewhat 17418 9.6 75.7 67.1 3 Uncertain 81399 45.0 76.2 47.4 4 Agree somewhat 10945 6.1 66.3 57.7 5 Agree strongly 4473 2.5 57.6 49.7 PrivateHI 1 Yes 90306 49.9 84.6 77.1 2 No 90565 50.1 73.3 46.6 Insured 1 Any private 92975 51.4 84.8 77.0 2 Public Only 61105 33.8 84.3 47.4 3 Uninsured 26791 14.8 46.5 42.4 2.Health Status Mental Health 0 Unknown 6191 3.4 56.2 32.1 1 Excellent 78150 43.2 77.5 57.8 2 Very good 45728 25.3 80.2 65.5 3 Good 39110 21.6 80.8 66.0 4 Fair 9407 5.2 89.1 76.4 5 Poor 2285 1.3 91.9 78.8 Physical Health 0 Unknown 6153 3.4 56.1 32.0 1 Excellent 58848 32.5 75.2 52.4 2 Very good 50495 27.9 79.6 64.4 3 Good 43570 24.1 80.8 66.4 4 Fair 16967 9.4 88.7 79.0 5 Poor 4838 2.7 95.4 87.7

(34)

N % %TOTEXP>0 %TOTSELF>0 Pregnant 0 Unknown 145077 80.2 79.0 60.9 1 Yes 2113 1.2 96.2 75.0 2 No 33681 18.6 77.4 65.3 Smoker 0 Unknown 70551 39.0 77.4 45.6 1 Yes 17573 9.7 76.7 67.6 2 No 92747 51.3 80.5 73.1 BMI <18.5 4761 2.6 75.4 66.6 18.5-24.9 94886 52.5 78.7 53.0 25-29.9 41812 23.1 77.1 69.4 >30 39412 21.8 82.0 74.6 General Health 0 Unknown 69852 38.6 77.5 45.4 1 Excellent 20928 11.6 68.3 58.7 2 Very good 36597 20.2 79.3 71.9 3 Good 35045 19.4 82.2 74.8 4 Fair 15298 8.5 88.6 81.7 5 Poor 3151 1.7 95.7 90.0 3. Socioecon. & Demographic

Doctor Visit 0 Unknown 7387 4.1 47.9 34.2 1 Yes 133107 73.6 88.0 68.7 2 No 40377 22.3 54.6 44.3 Sex 1 Male 86504 47.8 73.9 56.5 2 Female 94367 52.2 83.5 66.8 Race 1 White 121726 67.3 79.8 77.1 2 Not white 59145 32.7 77.1 22.9 Married 0 Unknown 5626 3.1 56.9 31.7 1 Married 60256 33.3 82.3 76.2 2 Widowed 7558 4.2 93.3 89.4 3 Divorced 18095 10.0 83.4 76.7 4 Never Married 89336 49.4 75.9 48.8 Family Income 0-8999 18317 10.1 77.1 45.8 9000-21499 28968 16.0 77.7 53.0 21499-42499 41799 23.1 75.8 56.1 >42500 91807 50.8 81.1 70.5 Wage Reported 1 Original 136831 75.7 81.6 61.5 2 Estimated 38433 21.2 72.0 64.3 3 Imputed 5607 3.1 60.9 55.3

(35)

CHAPTER 4. EMPIRICAL DATA 30

4.2 Set-up Empirical Analysis

In this section the set-up of the analysis of the MEPS (2018) dataset are described. As with the simulation study, first analyses for the first step are performed for both TOTEXP and TOTSLF. Subsequently, two-step models are used to estimate health expenditures, TOTEXP and TOTSLF.

Similar to the simulation study in Chapter 4, the performance of the two Cosslett models (COSS and COSSNN) depends for a large part on the ability to predict the participation to participate, or in this case, the ability to discriminate positive health expenditures or not. So a binary prediction task is performed by means of a probit and a neural network.

For the first-step estimation the following settings are used. For the probit all variables except quadratic and cross-terms are added, which is a total of 104 variables. For the neural network squared and cross terms are added of 3 numeric variables: BM I, AGE and F AM IN C. For the neural network the first layer consists of 110 nodes with activation tanh, second layer 13 nodes with activation tanh, thirth layer 5 nodes with activation tanh and output layer 1 node with activation sigmoid. 15 epochs are used. Every layer has a kernel regularizer L2 of 0.001. Loss function is binary cross-entropy, optimizer is adam’s stochastic gradient descent and metric is accuracy. All numeric variables are scaled to have mean 0 and standard error 1.

For the second-step estimation we compare OLS, OLSH, COSS, COSSNN and 2NN on AIC. See previous chapter for explanations of these methods. For COSS and COSSNN 25 dummies are used. For COSS dummies of the probit of the first step estimation will be used. For COSSNN dummies of the neural network of first step estimation. Settings for the neural network of 2NN: 15 epochs, L2-regularizer of 0.001, same amount of layers and nodes per layer as with neural network of COSSNN, tanh activation function for all layers and same layer set-up as the neural network of COSSNN. Loss function is mean squared error, optimizer stochastic gradient descent and metric mean absolute error. To obtain asymptotic correct standard errors, a bootstrap with R = 100 is performed. Although R = 500 or R = 1000 is more commonly used when bootstrapping, for this dataset it would be computationally infeasible. For both neural networks a validation split of 0.2 and batch size of 130 are used.

Results of model performance are presented in Table 5.2. Estimated parameter values are in Appendix B. Caution should be taken interpreting the estimated model parameters. Normally taking the log would change the interpretation of the estimated parameters to percentage change, but this is not valid for when truncation is present. For the Tobit-2 model:

δE(y2|x, y∗1 > 0)

δx = γ2− σ12λ(x

0_γ

1)(x0γ1+ λ(x0γ1)), (4.1)

for x the union of x1 and x2 and rewrite x01β1 as x0γ1 and x02β2 as x02γ2. See Cameron and

Selection Bias: A Machine Learning Approach

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Master’s Thesis

Selection Bias:

A Machine Learning Approach

Tom Meurs

Contents

Introduction

Theoretical Background

2.1

Econometrics

2.2

Machine Learning

2.3

Final Remarks

Chapter 3

The Model

3.1

New Methods

3.2

Monte Carlo Analysis

3.3

Conclusion

Chapter 4

Empirical Data

4.1

MEPS

Figure 4.1:

4.2

Set-up Empirical Analysis