• No results found

Misspecification in binary response models : an analysis of specification issues that arise in binary response modelling, using data on female labour force participation in the UK

N/A
N/A
Protected

Academic year: 2021

Share "Misspecification in binary response models : an analysis of specification issues that arise in binary response modelling, using data on female labour force participation in the UK"

Copied!
23
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

MISSPECIFICATION IN BINARY RESPONSE MODELS

An analysis of specification issues that arise in binary response modelling, using data on female labour force participation in the UK

Morena Bastiaansen 10725792

Abstract

In the current paper, various kinds of misspecification that arise in binary response models are studied. The aim of this study is to effectively describe the sources, conse-quences and solutions to these specification issues. This is done by applying different combinations of models and estimation methods to 2006 data on female labour force participation in the UK.

Under the supervision of Dr. Eleni Aristodemou Faculty of Economics and Business

University of Amsterdam May 2018

(2)

Statement of Originality

This document is written by Student Morena Bastiaansen who declares to take full respon-sibility for the contents of this document.

I declare that the text and the work presented in this document are original and that no sources other than those mentioned in the text and its references have been used in creating it.

The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Contents

1 Introduction 3

2 Theoretical Framework 4

2.1 Female labour force participation . . . 4

2.2 Misspecification in Binary Response Models . . . 5

2.2.1 Binary Response Models . . . 5

2.2.2 Linear Probability Model for Binary Response . . . 6

2.2.3 Index Models for Binary Response: Probit and Logit . . . 6

2.2.4 Specification Issues . . . 7

3 Data and Research Methods 9 3.1 Female labour force participation in the UK . . . 9

3.2 Description of the data set . . . 9

3.3 Research methods . . . 10

3.3.1 Models . . . 10

3.3.2 Estimation methods . . . 11

4 Results 13 4.1 Linear probability model . . . 13

4.1.1 LPM using ordinary least squares . . . 13

4.1.2 LPM using weighted least squares . . . 14

4.2 Probit . . . 15

4.3 Logit . . . 15

(4)

1

Introduction

An American citizen votes for Trump or not, a person uses Twitter or not, a woman has a job or not, a student attends a lecture or not. Those are all examples of binary choices. In economics as well as in other social sciences and in fields like medicine, many situations can be modeled as binary response models. In these models, the dependent variable of interest can only take on two possible values. These values are usually denoted by 1 and 0, “yes” and “no” or “true” and “false”.

A limited dependent variable is a variable whose range of possible values is restricted in some particular way. Because of the restriction on the possible outcomes of the depen-dent variable, binary response models fall within the definition of limited dependepen-dent variable models. The restriction on the dependent variable requires stronger assumptions on the parameters that need to be estimated. These stronger assumptions can cause great econo-metric challenges, and therefore can lead to various specification errors. When a model is not correctly specified, it fails to account for the relationship between the explanatory and the response variables. There are many different sources of misspecification. Some examples are omitted variables, heteroscedasticity of the errors and endogeneity of the regressors.

In econometrics, an endogeneity problem occurs when an explanatory variable is corre-lated with the error term (Heij et al, 2004). This leads to inconsistency of the estimators when the most common estimation method, ordinary least squares, is used. Consistency is a desired property of behaviour in econometrics. An estimator is said to be consistent if it converges to the true value of the parameter being estimated, as the sample size increases.

Lewbel, Dong and Yang (2012) discuss endogeneity in their paper. They summarise the advantages and disadvantages of some simple, practical ways to deal with the specification error of endogenous regressors in empirical binary choice models. They do this by compar-ing four different kinds of estimators: linear probability model estimators, control functions, maximum likelihood estimators and special regressor methods.

Klein and Spady (1993) propose a different estimator for binary response models. They formulate a semiparametric, “likelihood based” estimator under minimal distributional as-sumptions and evaluate the estimator by comparing it with other estimators.

In the current paper, various kinds of misspecification that arise in binary response mod-els are studied. The aim of this study is to effectively describe the sources, consequences and solutions to these specification issues. This is done by applying different combinations of models (linear probability, probit and logit) and estimation methods to 2006 data on female labour force participation in the UK. Comparisons are made between these different estima-tors in terms of their consistency and efficiency. Finally, the best performing estimator in these areas is determined.

Section 2 contains a theoretical background as well as a more detailed explanation on binary response models. In Section 3, the used data set and the applied research methods for this paper are described. The fourth section holds an overview and discussion of the most important results. Finally, in section 5, a conclusion is summarised and topics for further research are discussed.

(5)

2

Theoretical Framework

2.1

Female labour force participation

Much research has been conducted that examines the determinants of the female participa-tion rate. Because of social and behavioural changes over the past few decades, it might be of great interest to examine the factors underlying female participation and how these have changed over time. This section presents a theoretical framework for the study on the determinants of female labour force participation in the UK in 2006.

Jan Dirk Vlasblom and Joop J. Schippers (2004) study increases in female labour force participation in Europe. In their paper, they state that low educational levels and the effect of children are recognized as the most important factors for low female participation rates and that over the last few decades, female labour supply in Europe has shown a large increase. They describe that, during the sixties and seventies, the average working woman was a young woman without children, but that this is much different nowadays. They also explain that this may be caused by matters like changes in the level of education or fertility. However, in their paper it is also explained that it is possible that this increase is due to changes in be-haviour, as influenced by the social and institutional context. In their study, they use data on West Germany, Spain, France, Italy, the Netherlands and the UK to compare female labour market participation and their determinants in 1992 and in 1999. Vlasblom and Schippers use the explanatory variables age, educational level and the number and age of the children of the observant to estimate a model that explains female labour supply in the previously mentioned countries in 1992 and 1999. Their results show that especially the participation rates of married women have shown an increase over the last decade. In their conclusion it is described how, besides participation behaviour, also some characteristics of the population changed. To illustrate this, they explain that, in all countries, there has been an upward shift in education and postponent of family formation. They also conclude that more women remain childless and that women who do have children tend to have fewer. As expected, their results show that education has a positive effect on labour market participation and that age as well as the number of children has a negative effect. Finally, their main conclusion is that increases in female participation rates are primarily explained by the changing effects of age on female labour supply. This can be interpreted as behavioural differences between generations.

Jaumotte (2003) discusses the past trends and determinants of female labour force par-ticipation in OECD countries. She as well describes how female parpar-ticipation has increased strongly. Her paper states that surveys show preferences for the traditional male breadwin-ner model to be lower among couples with small children, and point to a large potential for increasing female labour supply. She reviews the market failures and distortions that might hinder female participation and her paper provides new empirical evidence on the effects of those. Her conclusions are that child benefits create an income effect that has a negative effect on female labour supply. In the paper is also concluded that increasing the availability of part time work possibilities as well as the provision of paid parental leave tend to have a positive effect on female participation. It is explained that the provision of parental leave as well as the supply of affordable childcare tend to stimulate more full time than part time participation.

(6)

This paper focuses on full time female participation. The determinants of full time female labour force participation in the UK in 2006 are analysed. This is done by applying binary response models to panel data on almost 70,000 UK women.

2.2

Misspecification in Binary Response Models

2.2.1 Binary Response Models

To analyse which explanatory variables have a significant effect on the decision of a woman whether or not to work full time, the variable F U LLT IM Ei is chosen as the dependent

variable. This variable can take on two possible values: 1 if observant i worked a full time job in 2006, and 0 if observant i did not work a full time job in 2006:

F U LLT IM Ei =

(

1 if observant i works fulltime

0 if observant i does not work fulltime

Because of this binary dependent variable, the model is called a binary response model. A binary response model is a certain type of discrete response model. Wooldridge (2002) explains that in discrete response models, the variable to be explained, y, is a random variable taking on a finite number of outcomes. In linear models, y is called the response variable and x = (x1, x2, ..., xk) is the vector of explanatory variables or regressors. In binary response

models, Wooldridge states that the interest lies primarily in the response probability, which is the probability that y takes on the value of 1, given the values of the regressors:

p(x) = P (y = 1|x) = P (y = 1|x1, x2, ..., xk) (2.1)

The response probability in this case would be the probability that a female in the UK in 2006 had a full time job, given her age, education, number of children and other variables that are used as regressors. When researching which effect regressor xj has on the explained

variable, y, the partial effect of xj on the response probability is of great interest. This is

given by the partial derivative of (2.1):

∂P (y=1|x) ∂xj

=

∂p(x)

∂xj (2.2)

Wooldridge explains that modelling binary response data pose challenges, because of the restriction on the dependent variable. The next two sections discuss which kinds of models are appropriate to model binary response data.

(7)

2.2.2 Linear Probability Model for Binary Response

Wooldridge states that the linear probability model (LPM) for binary reponse y is specified as

P (y = 1|x) = β0 + β1x1+ β2x2+ ... + βkxk (2.3)

When assumed that x1 is not functionally related to the other regressors, it can be derived

that β1 = ∂P (y = 1|x)/∂x1. This means that β1 is the change in the probability of y being

equal to 1 given a one-unit increase in x1.

The fact that we are working with probabilities, results in an extra challenge when apply-ing a binary response model. This comes from the fact that probabilities must lie between 0 and 1. Wooldridge explains that, unless the range of the x vector is severely restricted, the LPM cannot form an accurate description of the response probability P (y = 1|x). He states that, for given values of the population parameters βj, usually values of x1, ..., xk would be

feasible such that the response probability β0+xβ lies outside the unit interval and this forms

an issue. This is the reason that alternative models are needed to correctly model binary response.

Wooldridge also discusses appropriate estimation techniques for linear probability mod-els. He states that the ordinary least squares, or OLS (see section 3.3.2), regression of y on 1, x1, x2, ..., xk produces consistent, unbiased estimators of the βj. He also states that

heteroscedasticity is present unless all of the slope coefficients β1, ..., βk are zero. Wooldridge

suggests using standard heteroscedasticity-robust standard errors or weighted least squares as potential solutions to this problem. Weighted least squares is another estimation method that is explained in section 3.3.2. Wooldridge explains that this method is asymptotically more efficient than OLS. He also states that, when WLS is used, all other testing can be done using F-statistics or LM statistics.

2.2.3 Index Models for Binary Response: Probit and Logit

Wooldridge states that, to be able to restrict the range of possible outcomes of the response probability, index models can be used instead of linear models. He presents the form of the binary response model in this particular case:

P (y = 1|x) = G(xβ) = p(x) (2.4)

We can easily see that, for the previously discussed linear probability model, G(z) = z is the identity function. For index models is assumed that G(.) takes on values in the open unit interval: 0 < G(z) < 1 for all z. Wooldridge describes that an index model restricts the way in which the response probability depends on x and that the model is called an index model because p(x) is a function of x through the index xβ = β1+ β2x2+ ... + βkxk. He explains

that the index is mapped to the response probability by the function G.

(8)

that its specific form can sometimes be derived from an underlying economic model. He shows that, more generally, those index models can be derived from an underlying latent variable model:

y∗ = xβ + e, y = 1[y∗ > 0] (2.5)

In this paper, two cases of index models are discussed and applied: probit and logit. Wooldridge shows that the probit model is the special case of an index model (2.4) with:

G(z) = Φ(z) (2.6)

with φ(z) the standard normal density:

Wooldridge also presents another special case of equation (2.4), which is the logit model with:

G(z) = Λ(z) = exp(z)/[1 + exp(z)] (2.7)

It can be concluded that the logit model arises from model (2.5) where e follows the standard logistic distribution.

Wooldridge explains how the unknown parameters in a binary response index model can be estimated by the method of maximum likelihood (see section 3.3.2).

This paper studies specification issues that arise in binary response modelling, when the models mentioned above are applied. The strong assumptions that are needed because of the restricted dependent variable can cause these specification issues. The kinds of misspecification are studied as well as their causes, consequences and solutions. The next section discusses previous studies on misspecification in binary response models.

2.2.4 Specification Issues

There are several specification issues that can arise when applying binary response models, like probit and logit, to economic data. In this paper, LPM as well as both probit and logit models are applied to 2006 data on female labour force participation in the UK. The kinds of misspecification that arise in this process are analysed and the different estimators are compared. Wooldridge sums up and discusses three different kinds of specification issues that can arise in binary response modelling, focussing on probit models.

Firstly he discusses neglected heterogeneity, which is the problem of incorrectly leaving out variables when those left out variables are independent of the included explanatory variables. The model of interest is presented by:

P (y = 1|x, c) = Φ(xβ + γc) (2.8)

where x is 1 x K with x1 ≡ 1 and c is a scalar. If c is correlated with x, Wooldridge concludes, this causes endogeneity. The problem of endogeneity is explained below. If c is in some way dependent on x, which might be the case when, for example, V ar(c|x) depends on x, then omission of c has

(9)

serious consequences: the estimates will be inconsistent and this affects the validity of the results. But even if c and x are independent, the variance will be affected. This leads to inefficiency, lowering the quality of the estimator and therefore also affecting the validity of the results. Wooldridge ex-plains how, in probit analysis, neglected heterogeneity is a much more serious problem than in linear models. This is because, even is c and x are independent, the probit coefficients are inconsistent.

The specification issue of endogenous explanatory variables is also described by Wooldridge. This is the case when one of the explanatory variables is correlated with the error term in the latent variable model. Endogeneity can have serious consequences for the results. When one or more of the regressors is endogenous, this can have the consequence that the parameter estimates are no longer consistent, affecting the validity of the results (Heij et al, 2004, p. 398).

The third specification issue that Wooldridge discusses is that of heteroscedasticity. This is the situation where the variances of the error terms are not constant, which can lead to inefficiency (Heij et al, 2004, p.320). In binary response models, however, heteroscedasticity also leads to inconsis-tency.

Heckman (1979) discusses yet another form of misspecification, which is sample selection bias. He states in his paper that this specification issue results from non-randomly selected samples to investigate behavioural relationships. This is problematic, because biased estimators also affect the validity of the results (Heij et al, 2004).

Several studies have been carried out on misspecification in binary response models. Lewbel, Dong and Yang (2012) state that, in binary response models, regressors may be endogenous or mismeasured and errors are likely to be heteroscedastic. This paper studies the problem of mis-specification in binary response models by applying both probit and logit models to data on female labour force participation. The specification issues, mentioned above, that arise in this process are analysed as well as their causes, consequences and solutions.

(10)

3

Data and Research Methods

3.1

Female labour force participation in the UK

This section contains a description of the data as well as the research methods used for this study. To be able to analyse specification issues in binary response models, a binary dependent variable is needed. The binary dependent variable used in this study is full time female labour force partici-pation in the UK (see section 2.2.1). Factors that determine whether a female works full time or not, include age, number of children and education, as described in the section "Theoretical frame-work". Therefore, an appropriate dataset would be one containing information of a sample of adult females concerning previously mentioned factors and the intended dependent variable: whether the female works full time or not. Using different binary response models in combination with different estimating techniques, these data can be used to analyse the arising specification issues and their consequences.

3.2

Description of the data set

The data used for this study are retrieved from the UK Data Archive. This online data archive belongs to the UK Data Service. The UK Data Service provides access to high-quality local, regional, national and international social and economic data to meet the data needs of students, researchers and teachers from all sectors. The collection includes extensive UK government-sponsored surveys, UK census data, international aggregate, qualitative data and business data.

The data set considered for this study is indicated by study number 8161 and title "Female Labor Supply, Human Capital and Welfare Reform: Data from the Labour Force Survey, 1993-2006, and the British Household Panel Survey, 1991-2008". It contains two datafiles: the British Household Panel Survey (BHPS) datafile and the Labour Force Survey (LFS) datafile. This study uses the LFS file. This file contains 1153107 cases and 101 variables concerning women in the UK. The variables vary from age to marital status and number of children.

Firstly the data is filtered in such a way that only the observations from 2006 remain. The observations where the observant is already retired are then dropped. To be able to make use of the desired binary response models, the dependent variable is created as a dummy variable, taking on the value "1" for all cases where the observant works full time and taking on the value "0" for all cases where the observant does not work full time. The latter means that the observant works either part time or not at all. What remains is a data set with 72689 observations and all possibly relevant variables.

The next step is then to determine which independent variables to include in the model. When determining the explanatory variables, the theory is taken into account. In the previous section it is discussed which matters are considered to play a role in female labour force participation in other literature. Based on this theoretical background the following list of variables are chosen for this study:

• AGEi:

The age of respondent i at the time of the interview. • KIDS2i

The number of kids, younger than 2, of respondent i at the time of the interview. • KIDS4i

(11)

• KIDS9i

The number of kids, between 5 and 9, of respondent i at the time of the interview. • KIDS15i

The number of kids, between 10 and 15, of respondent i at the time of the interview. • DEGREEi

Value = 1 if respondent i has a graduate degree at the time of the interview.

Value = 0 if repspondent i does not have a graduate degree at the time of the interview. • COUPLEi

Value = 1 if respondent i is living together with their partner at the time of the interview. Value = 0 if respondent i is not living together with their partner at the time of the interview. • SINGLEMOMi

Value = 1 if respondent i is a single mother at the time of the interview. Value = 0 if respondent i is not a single mother at the time of the interview. • WAGEi

Gross weakly earnings of respondent i in British pounds.

In the following subsection, the different applied models and estimation techniques are discussed. Per model is described which of these possibly relevant variables are used.

3.3

Research methods

In this section is listed which models are applied to the data and which estimation methods are used to estimate the parameters.

3.3.1 Models

The first model applied in this study is the linear probability model (LPM). As described in the previous section, "theoretical framework", Wooldridge (2002) states that this model is specified as:

P (y = 1|x) = β0+ β1x1+ β2x2+ ... + βkxk (3.1)

Here, P (y = 1|x) denotes the probability that the dependent variable, y, takes on the value "1". In our case this would mean, the probability that a female in the data set worked full time in 2006. The explanatory variables used for this model are stated in the previous paragraph.

In paragraph 2.2.2 is explained how the linear probability model is restricted in a particular way and therefore poses an extra challenge when applying it to a binary response model. The sec-ond model applied in this study is therefore the probit model. This index model takes away the previously mentioned restriction, as described in paragraph 2.2.3.

The third model and second index model that is applied in this study is called the logit model. The way the logit model differs from the probit model is also described in paragraph 2.2.3.

(12)

3.3.2 Estimation methods

There are many different estimation techniques to estimate the parameters in the models listed above. Every technique has its own advantages and disadvantages, also depending on which model it is applied to. This section lists which estimation methods are applied to which models in this study.

• Ordinary least squares (OLS):

In a linear regression model, ordinary least squares, or linear least squares, is the most com-mon method for estimating the unknown parameters. Heij et al. (2004, p.80) explain how this method involves minimising the sum of squared residuals.

Several assumptions are posed for this method to derive unbiased, efficient and consistent esti-mators (Heij et al, 2005, pp. 92-93). Therefore, sometimes OLS is not appropriate. When, for example, trying to estimate the parameters of a non-linear model, other estimation techniques are required.

• Weighted least squares (WLS):

Weighted least squares is another estimation technique. When heteroscedasticity (see para-graph 2.2.4) is present, this estimation method derives a more efficient estimator (Heij et al, 2005). This means that the assumption of homoscedasticy that is required for OLS can be dropped. When using WLS, instead of minimising the residual sum of squares, the weighted sum of squares is minimised:

P

iwi(yi− xiβ)2 (3.3)

where wi are called the weights. With WLS, the weights are determined proportional to the variance at case i. With OLS, wi = 1 for every value of i.

• Maximum likelihood (ML):

A third estimation method is the method of maximum likelihood. Heij et al. explain that a model consists of a set {fθ; θ ∈ Θ} of joint probability distributions for y1, ..., yn. They

state that, for the given observations, the distribution gives a certain value fθ(y1, ...yn) for

every value of θ. The estimate of the value of θ for which this probability is maximal is the maximum likelihood estimate. This estimate is derived by maximising the likelihood function:

L(θ) = fθ(y1, ...yn), θ ∈ Θ (3.4)

Wooldridge states that the density of yi given xi can be written as:

(13)

where G(.) is the same G(.) as in (2.4).

Because the method consists of maximising the function, it does not matter whether you do this for the likelihood function or for the logarithm of this function, the log-likelihood. Taking the logarithm of the function can simplify the maximisation process:

(14)

4

Results

In this section the results of the run regressions are discussed. All Stata outputs are to be found in the appendix as well as the Stata do file that is used for this study.

4.1

Linear probability model

The first model that is applied is the linear probability model. The following model is estimated using Stata:

P (F U LLT IM E = 1|x1, ..., x9) =

β1+ β2x1+ β3x2+ β4x3+ β5x4+ β6x5+ β7x6+ β8x7+ β9x8+ β10x9 (4.1)

with

x1= AGE, x2 = KIDS2, x3 = KIDS4, x4 = KIDS9, x5= KIDS15,

x6= DEGREE, x7= COU P LE, x8 = SIN GLEM OM and x9 = W AGE. (4.2)

The estimated coefficients as well as the results of performed tests on heteroscedasticity and endo-geneity are discussed.

4.1.1 LPM using ordinary least squares

Firstly LPM is estimated by the estimation method of OLS (see paragraph 3.3.2). Stata gives the following results for the estimation of the coefficients:

Figure 1: Estimated coefficients in LPM regression using OLS

The estimations of the coefficients all have the expected sign. The literature investigated for this study showed that women rather tend to work full time jobs when they are young and do not have children yet. The estimated coefficient for the variable AGE is indeed negative as well as the estimated coefficients for the variables KIDS2, KIDS4, KIDS9 and KIDS15. The positive estimated effect of a higher degree is also expected as well as that of a higher wage. The dummy variables SIN GLEM OM has a negative sign. This could be explained by the fact that a single mother does not have a partner to divide the supervision of the children with. The dummy variable COU P LE also has a negative sign. This is also as expected, as this could be explained by the fact that a single woman does not have the opportunity to rely on the income of a partner. The positive sign of the estimated coefficient for DEGREE is also as expected, as well as that of the variable

(15)

W AGE. There is a lot of literature that shows that a higher degree as well as a higher market value has a positive effect on the labour force participation of females.

The results of the F-test show that, with very high confidence (99%), can be concluded that the explanatory variables are jointly significant. Also, the p-values of each t-statistic show that all explanatory variables are significant. The R-squared is 28%, which is relatively low.

The results of the test for heteroscedasticity (2.2.4) show that The null hypothesis of this test is that there is no heteroscedasticity. The p-value of 0.0838 shows that, at the significance level of 5%, the null hypothesis is rejected and there can be concluded that there is sufficient significant evidence for the existence of heteroscedasticity in this model.

The independent variable SIN GLEM OM might be endogenous, as it can be expected that this variable migt be correlated with one or more of the variables KIDS2, KIDS4, KIDS9 or KIDS15, as for a single mother it is certain that she has one or more children. Therefore, a test for endogeneity is performed. The null hypothesis of the test is that the variable SIN GELM OM is exogenous. The results of this test show that there is not sufficient significant evidence to reject this null hypothesis.

4.1.2 LPM using weighted least squares

Because of the presumable existence of heteroscedasticity in the model it is also estimated using the estimation method of WLS (see section 3.3.2). Stata gives the following results for the estimation of the coefficients:

Figure 2: Estimated coefficients in LPM regression using WLS

The weight types chosen for this regression are the absolute values of the residuals. This type is chosen because it resulted in the highest R-squared by far. This R-squared is 49%, which is much higher than the R-squared using OLS.

The signs of the estimated coefficients are all the same as when using OLS. The values of the coefficients do not show any notable changes from the ones using OLS, except maybe for those of KIDS9 and KIDS15 which are somewhat higher when using WLS.

The results of the F-test show that, with very high confidence (99%), can be concluded that the explanatory variables are jointly significant. Also, the p-values of each t-statistic show that all explanatory variables are significant.

The results of test for heteroscedasticity show that there is not sufficient statistical evidence to reject the null hypothesis of no heteroscedasticity. This is as expected, as the literature studied in section 2 suggested WLS as an estimation method to solve for heteroscedasticity.

The results of the test for endogeneity again show that there is not sufficient significant evidence to reject the null hypothesis that SIN GLEM OM is exogenous.

(16)

4.2

Probit

The second model that is applied is the probit model:

P (F U LLT IM E = 1|x1, ..., x9) = Φ(xβ) (4.3)

with Φ(.) as in (2.6), x1, ..., x9 as in (4.2) and

xβ = β1+ β2x2+ β3x3+ β4x4+ β5x5+ β6x6+ β7x7β8x8+ β9x9. (4.4)

The probit model is estimated using maximum likelihood (see paragraph 3.3.2). Stata gives the following output for the probit regression using ML:

Figure 3: Estimated coefficients in probit regression using ML

The value of the estimated constant is almost twice as high as the OLS estimation with the linear probability model. This is the case for every coefficient. The signs of the estimated coefficients are all equal to those when estimating the linear probability model.

The results of the p-values show that every explanatory variable is significant. The p-value of the LR test statistic is equal to zero. The null hypothesis of this test is that at least one of the regression coefficients in the model is not equal to zero, which can be rejected based on the low p-value. Testing for jointly significance also results in a p-value of zero from which can be concluded that the explanatory variables are jointly significant.

The LR-statistic is used to test for heteroscedasticity. The null hypothesis of this test is ho-moscedasticity and the p-value is equal to zero. Therefore it can be concluded that there is sufficient statistical evidence for the presence of heteroscedasticity in the model.

To test for endogeneity, the Stata command ivprobit is used. Stata concludes that there is no endogeneity present in the model.

4.3

Logit

The third model that is applied is the logit model. Just like the probit model:

(17)

with Λ(.) as in (2.7), x1, ..., x9 as in (4.2) and xβ as in (4.4).

Stata gives the following output for the logit regression.

Figure 4: Estimated coefficients in logit regression using ML

The estimated coefficients are higher than those estimated in the probit model and the LPM. The p-values as well as the p-value of the LR-statistic give the same results as those for the probit re-gression, explained above.

Due to the very limited possibilities in Stata to test for the specification issues of heteroscedas-ticity and endogeneity in logit models, there is not much to say about the results misspecification in this model. However, the literature studied shows that the results of probit models and logit models generally are almost equal.

(18)

5

Conclusion

The aim of this study is to effectively describe the sources, consequences and solutions to spec-ification issues that arise in binary response modelling. To do this, data on female labour force participation is used to estimate a linear probability model, a probit model and a logit model. All combinations of models and estimation methods that are used for this study show significant effects of the explanatory variables on the repsonse variable.

The linear probability model is estimated using both OLS and WLS. The signs of the estimated coefficients are in line with the expectation of those. The test for endogeneity shows that there is not sufficient statistical evidence for the endogeneity of the variable SIN GLEM OM . This is the case when both OLS and WLS are applied. Testing does show statistical evidence for the presence of heteroscedasticity in the LPM when OLS is used. As explained in section 2.2.4, heteroscedasticity can result in ineffiency and inconsistency of the estimator, which is very undesirable. The results of the test when applying WLS show that there is not sufficient statistical evidence for heteroscedas-ticity in this case. Therefore, WLS is preferred to OLS as an estimation method for this specific model.

However, the literature studied in section 2 shows that a linear model like LPM is not the most appropriate model for a binary response model, because of the restriction on the dependent variable. Models that are more appropriate are index models, like probit and logit. The probit model is the second model that is applied to the data. The model is estimated by maximum likelihood and shows the same signs for the estimated coefficients as in de LPM. However, the values of the estimated coefficients are notably higher than in the LPM. This could be explained by the fact that this model is more far more appropriate to model binary response data, as explained in section 2. Testing shows that there is again sufficient statistical evidence for the presence of heteroscedasticity, but not for endogeneity of any of the independent variables.

The estimated coefficients for the logit model are higher than those estimated in the probit model. Because of the specification of the logit model, the possibilities to test for misspecification in this type of model are very limited. Because the insights into the probit model are much more comprehensive, this model is preferred to the logit model.

Because of what the literature studied in section 2 showed and because of the arguments men-tioned above, the probit model estimated by maximum likelihood seems like the best method to model these binary response data.

For further investigation it could be of great interest to improve this probit model to solve for the heteroscedasticity. What could be also very interesting is to test for the other specification issues that are discussed in section 2.2.4. Furthermore, possibilities to test for specification issues in logit models could be investigated and other index models could also be taken into account.

(19)

References

Costa Dias,M., Shaw, J. (2017). Female Labor Supply, Human Capital and Welfare Reform: Data from the Labour Force Survey, 1993-2006, and the British Household Panel Survey, 1991-2008. [data collection]. Office for National Statistics, University of Essex. Institute for Social and

Economic Research, [original data producer(s)]. UK Data Service. SN: 8161, http://doi.org/10.5255/UKDA-SN-8161-1

Davidson, R. and MacKinnon, J.G. (1984), "Convenient specification tests for logit and probit models", Journal of Econometrics, 25, 241-262.

Heckman, J.J. (1979): "Sample Selection Bias as a Specification Error", Econometrica, 47 (1), 153-161.

Heij, C., P. De Boer, P. H. Franses, T. Kloek, H.K. Van Dijk, et al. (2004): Econometric methods with applications in business and economics. OUP Oxford.

Jaumotte, F. (2003), "Female Labour Force Participation: Past Trends and Main Determinants in OECD Countries", OECD Economics Department Working Papers, No. 376, OECD

Publishing.

Lewbel, A., Y. Dong, and T.T. Yang (2012): "Comparing features of convenient estimators for binary choice models with endogenous regressors," Canadian Journal of Economics/Revue

canadienne d economique, 45 (3), 809-829.

Vlasblom, J.D. and Schippers, J.J. (2004), "Increases in Female Labour Force Participation in Europe: Similarities and Differences", European Journal of Population, 20, 375−392.

(20)

A Appendix

(21)

Figure 5: Stata output of LPM regression using OLS

(22)
(23)

Referenties

GERELATEERDE DOCUMENTEN

The connection between the government-biased military intervention and the duration of the civil war is made visible once I examined how the interventions

lnstede van die pastelkleurige, glanslose, belderomlynde landskappe met m a jestueuse wolke en kremetartbome, soos ek Pierneef maar ken, was daar 'n reeks

Door de gegevens te rubriceren naar slachtleeftijd kunnen ze gebruikt worden om een indruk te krijgen van de uitval per productiesysteem: slachten voor 49 dagen leeftijd

Het econo- misch voor- of nadeel van een eventuele bedrijfsaanpassing wordt bepaald door enerzijds de verandering in het saldo van opbrengsten minus variabele kosten en anderzijds

The fact that the practice of environmental marketing can have a positively effect perceived corporate image and purchase intention (Patel, Gadhavi &amp; Sukha, 2017), and the use

The main results from this simulation are that the expected value of the unemployment rate is not negative and that a scenario with a higher employment growth rate will lead to

Thee change from subjective to objective measures to estimate the impact of health onn retirement was substantiated by the observation that self-reported individual healthh

In de derde fase is in overleg met belanghebbende partijen een agenda voor land- bouw, natuur en landschap in Limburg opgesteld, die gezien moet worden als een aanbod van de