• No results found

The effect of health on income using self assessed health status

N/A
N/A
Protected

Academic year: 2021

Share "The effect of health on income using self assessed health status"

Copied!
46
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master Thesis Econometrics

The effect of health on income using

self assessed health status

by

Benjamin van Arum

10371400 April 24, 2017 First supervisor: Prof. Dr. J.F. Kiviet Second supervisor Dr. J.C.M. van Ophem

(2)

Abstract

In this study, the relationship between health and income is investigated. This is done whilst using a self assessed health status variable as proxy for health. Using a large panel dataset, a dynamic model has been constructed in which en-dogeneity of regressors has been taken into account. Using Arellano-Bond GMM estimation methods, it has been established that this relationship is dynamic, although previous research mostly used a static model. All model specification tests regarding autocorrelation and instrument validity show satisfactory results, nevertheless the effect of health on income is not found to be significantly posi-tive.

Keywords: Self Assessed Health Status, Income, Dynamic Model, Arellano-Bond GMM, Panel Data.

(3)

Statement of originality

This document is written by Benjamin van Arum who declares to take full re-sponsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

Preface

All good things comes to an end. I would like to thank professor Jan Kiviet for his support and assistance during the whole process. I feel very lucky that I have had a supervisor who always quickly responded to questions and gave very useful feedback. From his lectures about panel data in the beginning until submitting this thesis, it was really stimulating to see his passion for econometrics and it has been an honour that he voluntarily supervised me with this study.

(4)

Contents

1 Introduction 1

2 Previous literature 3

2.1 The causal effect of income on health . . . 3

2.2 The causal effect of health on income . . . 5

2.3 Self assessed health status . . . 6

2.3.1 Reliability . . . 7

2.4 Added value to the available literature . . . 9

3 Data 10 3.1 Data source . . . 10

3.2 Subsample . . . 10

3.3 Pointwise missing values . . . 11

4 Methodology and model choice 13 4.1 Methods . . . 13

4.1.1 Arellano-Bond . . . 13

4.1.2 Tests aiding the Arellano-Bond estimation procedure . . . 15

4.1.3 One-step and two-step GMM . . . 17

4.1.4 Blundell-Bond . . . 18 4.2 Model . . . 18 4.2.1 Variables . . . 19 4.2.2 Classification . . . 21 4.2.3 Elimination of lags . . . 24 5 Results 25 5.1 Main results . . . 25 5.2 One-step Arellano-Bond GMM . . . 29

5.2.1 Changes of the SAHS variable . . . 29

5.2.2 Smaller instrument set . . . 30 5.2.3 Less strict constraint on sufficient non-zero observations . 33

6 Conclusion 35

(5)

1

Introduction

The mutual relationship between income and health has been examined many times. A much debated topic is the direction of causality in the health income relationship. See for example Smith (1999). In most of the previous studies, it is argued that income affects health way more than that health affects income. Since this stand is much debated, it can be interesting to delve into the opposite: investigating what the causal effect of health on income is. This causal effect will be estimated with a dynamic model using a panel dataset that has not previously been used for this subject, which can lead to new inspiring insights. The study can be interesting for somewhat less healthy (or unhealthy) people, because it can be motivating for them if it turns out that better health indeed improves income.

The research question in its broadest sense in this study is: to which extent plays health a role in income that is earned. This main question will be answered using econometric panel techniques on a model in which the dependent variable is gross wage (per hour) from employment and the health indicating variable is a self assessed health status (SAHS). There are many factors that affect working people’s earnings, such as education, age, work experience etc. In as far as available any of such variables will be included in the model as control variables. The model will be estimated using the LISS panel dataset1 that consists of 7 waves, from 2007 to 2013. There are many advantages of using panel data instead of cross sectional data. For example when there is endogeneity in the model, instruments are needed to perform a two stage least squares estimation. A panel dataset makes it possible to use internal instruments, such as lagged variables. It is clear that examining the effect of health on income comes with endogeneity, since income itself affects health, but health affects income too2. Furthermore, panel data gives the possibility to build a dynamic model instead of a static model. There are many more advantages of using a panel dataset. The panel dataset will be described comprehensively in the data section, which

1Which is collected by CentERdata (Tilburg University, The Netherlands) through its MESS

project funded by the Netherlands Organization for Scientific Research. More information about the LISS panel can be found at: www.lissdata.nl.

2

For example: rich people can afford better (more expensive) medical care which can lead to better health. And on the other hand: healthy people will be less often absent from work.

(6)

is section 3, of this study.

Section 2 describes previous literature on the subject: the mutual relationship between income and health. In this section, usage of the main health indicating variable, SAHS, will be substantiated and criticised. Section 3 will present the dataset and in particular which operations are done to make analysing the data possible3. Section 4 emphasizes the methodology and the models that will be estimated. Furthermore, the robustness checks will be illustrated. In section 5, the results are shown and discussed. Section 6 provides a conclusion.

(7)

2

Previous literature

There is a vast amount of literature on the relationship mentioned in the intro-duction. In some of the studies, a static model is used to analyse the relationship, while in other studies a dynamic model seems the best way for research on this subject. In this section, previous studies on the health income relationship will be described. What kind of regression methods are used? Did the author(s) use longitudinal data or cross sectional data? How did the writer(s) cope with endogeneity? What kind of model did the researcher(s) use, in particular which variables did the author(s) include in their model? Those questions will be answered per described study in this section4.

It is clear that investigating this relation comes with endogeneity, since in-come affects health, but health affects inin-come too at the same time. As already stated in the introduction, most of the previous studies investigated the causal effect of income on health, therefore some of those studies will be discussed in the first subsection. The causal effect of health on income, on which this study focuses, will be discussed in the second subsection. The third subsection stresses the reliability of the main explanatory variable, SAHS. The last subsection clar-ifies the added value of this study to existent literature.

2.1 The causal effect of income on health

One of the first studies on the causal effect of income on health was done by Ettner (1996). With the usage of cross sectional data and a static model, Ettner did a good job in the sense that her study on this relationship is frequently cited in studies on this subject. Dealing with endogeneity, she was capable to do anal-yses with the use of instrumental variable (IV) regressions. The variable income is implemented as the total income earned by the household. Results that she found were positive significant effects of income on health indicating variables, i.e. higher income leads to better health. Income shows a positive significant effect in both the OLS and IV regressions on for example the health indicating variable SAHS, that will be used too in this study. Instrumental variables that Ettner used, are state unemployment rate, work experience, parental education and spouse characteristics.

(8)

Contrary to Ettner, Meer et al. (2003) estimated the effect of wealth on health using panel data of 4 waves, 1984, 1989, 1994 and 1999, with a dynamic model. The instrument that they used for wealth is inheritance. The basic model that the authors used can be found in (1), where Ht is an indicator for whether the

individual is healthy (healthy=1) in year t; ∆W is the change in wealth from year (t-5) to year t; and F(·) is the cumulative normal distribution; Xtcontains

age, household wealth at the beginning of the 5 year period, education, sex, race and region. Findings are that wealth variable (∆W ) and the health variable at time t-5 (Ht−5), show an insignificant positive effect on health at time t after IV

regression. Their conclusion is that the effect of wealth on health is small and that it is not caused by short-run changes in wealth.

P r(HT = 1) = F [β0+ β1(∆W ) + β2Ht−5+ β3Xt] (1)

Since it is known that income is endogenous because of the simultaneous effect, an alternative approach is to consider a sort of income that is not affected by health directly. Lindahl (2005) tried to estimate the effect of income on health and mortality using lottery prizes as an exogenous source of variation in income5. Lindahl found significant positive effects of income on health using a dynamic model. The main dependent variable in his analyses is a standardized index of bad health. First he used OLS and afterwards probit with IV. See (2), where Hi81 denotes measures of poor health in year 1981 for individual i; ¯Li81,13

is the average lottery prize over the years 1969-1981 for individual i; ¯Li81,15 is

the average lottery prize over the years 1967-1981 for individual i; Xitis a vector

of control variables;  and ν are random error terms.

Hi81 = α + βlog( ¯Li81,15) + θ0Xit+ i81

log( ¯Li81,15) = π0+ π1log( ¯Li81,13) + τ0Xit+ νi81

  

(2)

It may be clear that there is a wide range of instrumental variables used to examine the aforementioned relationship. Fichera and Savage (2015) used rainfall6 as instrumental variable for income to determine the effect of income on health for the civilization in Tanzania. The authors mention correctly that if

5This kind of income is exogenous to health since health does not have an effect on winning

a lottery prize.

6

(9)

health depends directly on rainfall, the latter is not a good instrument. While they are aware of the arguments against using rainfall as instrument, they also find arguments that substantiate the choice of instrumentation with rainfall. In the end, the authors support the usage of rainfall as instrument by various tests which can be found detailed in their study.

The first stage regression can be found in (3) and the second stage in (4), where Hi(j)t represents any of the health outcomes7; Xi(j)tincludes control

vari-ables such as age gender etc.; Djt is the distance of village j with the closest

health center; lnincomei(j)tdenotes the logarithm of the real income per capita;

ni(j)and θi(j)are the fixed effects; vi(j)tand ii(j) are the error terms; zi(j)t−1

de-notes the rainfall in millimetres from t-2 till t-1, and for all variables: i represents the individual living in village j at time t.

lnincomei(j)t= αzi(j)t−1+ θi(j)+ vi(j)t (3)

Hi(j)t= βlnincomei(j)t+ γXi(j)t+ δDjt+ ηi(j)+ ii(j) (4)

Although Fichera and Savage did not find any significant relationship be-tween BMI and income, they find a significant effect that an increase of 10% in income leads to a decrease of 2% in number of illnesses, which again confirms that income and health are positively related.

2.2 The causal effect of health on income

In this subsection, the relevant part of the literature in which the causal effect of health on income is investigated will be discussed. Wu (2003) investigated what kind of influence health events have on economic status of married couples. Wu made use of 2 waves of data, but he considered a static model. He regressed the change of wealth8 from those 2 waves, on two variables containing husband- and

wife health shocks9respectively. Control variables that are used in this study are age, race, education, initial health status (based on SAHS) and forced retirement due to poor health. Some of his findings were that health shocks have, as was his hypothesis, a strongly negative effect on couples wealth. An interesting notion

7

BMI, self reported illness, weight-for-height ratio and height-for-age ratio for children under the age of 6.

8W t-Wt−1.

9Health shock in the sense that respondents suddenly suffer from for example cancer,

(10)

among his findings was that the effect for women was significantly larger than for men. Furthermore, Wu did not use IV regression since he assumes health shocks to be exogenous explanatory variables. Due to outliers in the upper tail of the data, ordinary least squares regressions should lead to imprecise coefficients, therefore quantile regressions are used to analyse the data.

Another study on the effect of health shocks of either husband or wife on the effect of labor supply is done by Charles (1999). He finds, similar to Wu, that when wives suffer from a health shock it has significantly more impact on the labor supply of the couple than when men suffer from a particular health shock. This could possibly be caused by the natural behaviour of women in contrast to men. Husbands are often the main breadwinner of the couple, so when their wife gets ill, he probably reduces his regular working time. Women are more often part time workers, they may increase their working time if the husband gets ill. One author that did research on the health income relation repeatedly is Smith. In Smith and Kington (1997), he addresses some of the aforementioned matters arising in these kinds of studies10. The motive of Wu’s aforementioned empirical study was Smith (1998). Smith analysed the same data as Wu but did not use regression analyses. His research was particularly based on analysing patterns of health shocks and wealth of respondents across waves. Wu revised this study by performing regression analyses and taking the difference between men and women into account. Smith concludes that studies in which only the causal effect from wealth (or to some extent similarly: income) on health is taken into account, are really missing the point. This conclusion is based on his findings that the reverse causality is at least as important. This conclusion does he also make in his study the subsequent year (Smith, 1999).

2.3 Self assessed health status

In the introduction of this subsection, a description about the self assessed health status will be given. Additionally, the way how the SAHS is measured will be reported on. The second part of this subsection pays attention to the reliability of using self assessed health status as health indicating variable. The author of this study is aware of the fact that this SAHS might look to simple to do research

10The discussion about the direction of causality for example. Smith’s study is not an

(11)

with. Therefore, to debunk this statement, he pays attention to the reliability of SAHS as proxy for the real healthiness of respondents.

The SAHS is in fact constructed very easily. The status is obtained through the following question to a respondent: How would you describe your health, generally speaking. The respondent has to choose between 5 answers: poor, moderate, good, very good and excellent.

2.3.1 Reliability

Since SAHS will play an important role in this study, it is inevitable to describe some literature that criticizes and respectively recommends the use of SAHS as proxy for real health status. One could actually do research on this issue as such, but in this study it is assumed that SAHS is a good indicator for the actual healthfulness of respondents. Why this assumption can be made, will be argued below. Note that only the studies in which the SAHS is constructed exactly the same as in this study are taken into account to make fair comparisons.

There are several studies that examined the relationship between SAHS and mortality. For example Idler and Kasl (1995), they found that the correlation be-tween SAHS and subsequent mortality is significantly strong. Bopp et al. (2012) found results that SAHS and hazard rate show almost a linear relation. The methodology will be described shortly. People assessing their health status as excellent11were given hazard rate 1. This rate of 1 is used to make comparisons with the mortality rate of other self assessed health statuses. So for example, people with hazard rate 10 have a probability of dying that is 10 times higher in a specific time period than people with hazard rate 1. Findings in this study were that people with SAHS of 3 and 5 (good and poor respectively) have a hazard rate of 1.41 and 2.85 respectively. This indicates that people that assign worse health to themselves, are more likely to pass away than people who assign better health to themselves. In this study, they followed Swiss men and women that were 16 years or older in 1977, until 2008. Tamayo-Fonseca et al. (2013) found similar results and just as Bopp et al. the results remained significant after adding covariates such as education, age, gender and more.

Many more studies investigating the relation between SAHS and mortality found similar results. Appels et al. (1996) investigated the same matter with

11

(12)

respondents from the Dutch city Rotterdam, which may be of interest for this particular study since data from the Dutch population will be used for the analy-ses. Appels et al. found, not surprisingly, approximately the same results as most studies on the SAHS-mortality relation: a low SAHS corresponds to a relatively high mortality rate.

Besides the research on the SAHS-mortality relation, there is some literature on the relationship between SAHS and other health measures. For example Singh-Manoux et al. (2007) concluded that people with low SAHS are more likely to be absent from work than people with high SAHS. Idler and Kasl found that people with a low SAHS experience more often limitations with daily activities than people with a high SAHS. Miilunpalo et al. (1997) examined the predictive value of SAHS for health measures among the working-age population. They suggest that SAHS is a legitimate indicator for real health status for people of middle age.

There are also studies in which the SAHS is criticised. Crossley and Kennedy (2002) examined the reliability of SAHS as a proxy for real health. To obtain results, they investigated respondents’ answers on a survey before and after some health related questions. Among their findings are that 28% of the respondents changed their SAHS after the health related questions and 3% of them changed their SAHS with more than 1 category12. They suggest that this should be kept

in mind when comparing different comprehensive surveys with the same SAHS, but that the SAHS is a relatively good indicator for healthfulness of respondents just as the SF-3613, although it could somehow be measured with error. The health related surveys from the LISS data are exactly the same for every year in the panel, so this mentioned comparison issue is not something to deal with in this study. Although measurement errors in the SAHS will have to be taken into account.

12

For example they assessed 4 before the questions and 2 after the questions, which is a change of 2 categories.

13The SF-36 is a survey that consists of 36 questions regarding health. It is used widespread

in previous research for health purposes. See http://www.sf-36.org/tools/sf36.shtml for more information.

(13)

2.4 Added value to the available literature

Now that the relevant existent literature has been discussed, it may be clear that there is space for a new study on this subject with respect to the SAHS. The specific part on which this study focuses, namely the effect of SAHS on income, is barely covered in existing studies. Most of the studies investigate the effect of income on health, while the opposite will be examined in this study.

Nearly all of the studies lack robustness checks, which gives this study added value to the available literature since robustness checks will be performed in this study. Moreover, in most studies it turned out that finding a strong and valid instrument set is very difficult. This will not be a problem in this study since there will be enough instruments constructed from lagged regressors. Further-more, it will be argued why a dynamic model makes more sense than a static model while examining the aforementioned relationship.

(14)

3

Data

This section describes the empirical data, and the source. In the first subsection, the data source will be described. The second subsection explains something about the subsample that is used from the total dataset to perform the analyses. The third subsection clarifies how there is dealt with gaps and missing values.

3.1 Data source

The data that is used in this study, is provided by the LISS database (Longitudi-nal Internet Studies for the Social sciences) which is collected by CentERdata14 (Tilburg University, The Netherlands). This data is a representative sample of the Dutch population. There are 10 core questionnaires and a background ques-tionnaire. Health, schooling, personality and political view are examples of core questionnaires.

In all the waves, approximately 9000 individuals participated in at least 1 wave. At every participating household, the household head filled in the ques-tionnaires and sometimes kids, partners or other household members filled in the questionnaires too. Eight waves of data, from 2007 to 2015, are obtained from this study. Note that 2007 to 2015 contains 9 years, but there are 8 waves of data. This is caused by the fact that, contrary to the other waves, the last wave is not equidistant. There are no questionnaires spread out to the households in 2014. So, in this study the eighth wave is left out of the data15.

3.2 Subsample

The individuals that filled in the core questionnaires with health and work related questions are the observations that are kept in the sample. But first, it is essential to merge the 7 particular waves of data. Unfortunately, some observations will be lost through the merging process. In total 9510 individuals that participated in at least 1 of the 7 waves have been found.

To answer the research question, only a subset of this total sample is needed. To be more specific, the subset only consists of people that earn income from employment. Individuals that are kept in the sample are individuals that work

14

For more information visit www.centerdata.nl.

15

(15)

at least 2 days a week16. Furthermore, this study will focus on the population aged 15 to 70 since those individuals are considered as the working population. Additionally, the minimum wage for a 15 year old is considered to be around 5 euro gross wage per hour, so that this wage is used as lower bound for hourly wage17. Taking into account those constraints, 4685 individuals remain in the sample.

Unfortunately, since the regression technique that will be used in this study can not cope with individuals that participated in the questionnaires but with gaps18and taking into account the dynamic relationship, individuals are removed from the sample if they did not participate in at least 3 successive waves. This bound is set since there are at least 3 successive time points needed to perform the analyses within an Arellano-Bond setting in which the lagged dependent variable is included as regressor. The latter mentioned will become more clear in the next section. There are 2002 individuals that are left in the sample after this elimination.

Notice that this bound, which is currently set to 3, can be adapted but that it depends on the model specification with respect to the dynamics. This is caused by the fact that the programme that is used for analysing the data drops observations when there are not enough successive waves available of a specific observation. When it is assumed that deeper lags affect the dependent variable, hourlywage, one could change the limit of successive waves.

3.3 Pointwise missing values

This subsection provides information about how there is dealt with pointwise missing values in some of the variables. First some light at missing values of the income variable will be shed.

The variable hourlywage is calculated using information about the monthly gross income and the average actual working hours per month. The data provider developed a procedure for people that did not, or entered ”I don’t know” or 0, fill in the question with respect to their monthly gross income. Using this

16This study investigates the working population, in particular: the individual must at least

work part-time.

17

This information is obtained from www.government.nl.

18

Meaning that their participation did not consist of only successive years, for example par-ticipation in wave 1, 3, 4 and 5 or 1, 2, 3, 5 and 7.

(16)

procedure, based on net income19, the missing values are estimated. If some of the observations still contained missing values for the income variable after this procedure, the missing value is replaced by its predecessor or successor.

For all other variables, the missing values are replaced by the value of the successor/predecessor of the variable. When there were more than 2 successive missing values in some of the variables, the observation has been dropped.

(17)

4

Methodology and model choice

This section describes the model and the methodology that is used to obtain results. In the first subsection, the methodology will be explained. This first subsection provides some information about dynamics, Arellano Bond, Blundell Bond and several tests that have been used in this study. The second subsection contains a model description. In this subsection, individual fixed effects within the model will be described as well as the model equation(s).

4.1 Methods

This subsection provides information about the methodology that is used for the estimation of a dynamic model. The first part explains some advantages of using panel data as well as some basic theory that the author finds necessary to mention about how Arellano and Bond (1991) applied GMM using internal instruments. The second part describes all tests that are used to indicate the validity of the models. Afterwards, one-step and two-step GMM are explained and the finally an explanation about the Blundell-Bond estimation will be given.

4.1.1 Arellano-Bond

One of the great advantages of panel data is the ability to deal with unobserved individual fixed effects (Roodman, 2009a). Those individual fixed effects, see ηi

in (5), are very hard to estimate. To deal with those individual fixed effects, one could take first differences of the model so that ηi cancels out. This is

illus-trated in (5) with the usage of a very basic dynamic model, yi,t is the dependent

variable, xi,t is the regressormatrix including exogenous regressors, wi,t includes

predetermined regressors and vi,tincludes endogenous regressors, τtare the time

specific effects and i,t are the error terms.

yi,t =αyi,t−1+ x0i,tβ1+ x0i,t−1β2+ wi,t0 γ1+ w0i,t−1γ2+ (5)

vi,t0 δ1+ v0i,t−1δ2+ ηi+ τt+ i,t

∆yi,t =α∆yi,t−1+ ∆x0i,tβ1+ ∆x0i,t−1β2+ ∆w0i,tγ1+ ∆w0i,t−1γ2+ (6)

∆vi,t0 δ1+ ∆vi,t−10 δ2+ ∆τt+ ∆i,t

A problem that arises here is that in (6), ∆yi,t−1 is correlated with ∆it

(18)

instru-mental variable approach. In the final model there will be more factors causing endogeneity so that instrumental variable regression techniques are obviously desired. Endogeneity could additionally be caused by simultaneity, omitted vari-ables and measurement errors.

A widely used methodology is to use internal lags of the regressors as instru-mental variables. The internal instrument GMM technique called Arellano-Bond estimation is performed in this study. One very important assumption that is necessary to make is that there is no serial correlation within the error terms. Although the author is aware of the fact that there is plenty of literature that explains the Arellano-Bond estimator thoroughly, the Arellano-Bond approach will be explained shortly since this procedure will be an important part of this study. For a more comprehensive explanation on Arellano-Bond estimators the reader could consult some of the referred papers in this section.

As already mentioned, internal lags of the regressors and the dependent vari-able can be used as instruments. Those instrumental varivari-ables need to be corre-lated with the regressors but they may not be correcorre-lated with the error terms. Following the Arellano-Bond estimation procedure, the regressors are categorized to endogenous, predetermined and exogenous. In (7), (8) and (9) below, the mo-ment conditions of the endogenous regressors wi,s, the predetermined regressors

vi,s and exogenous regressors xi,s from (5) can be found.

E[wi,s∆i,t] = 0 for t = 2, ..., T ; s ≤ t − 2 (7)

E[vi,s∆i,t] = 0 for t = 2, ..., T ; s ≤ t − 1 (8)

E[xi,s∆i,t] = 0 for t = 2, ..., T ; ∀s (9)

Internal instruments constructed from endogenous variables can only be used from lag 2 onwards, whereas predetermined variables are applicable as instru-ments from lag 1 and further. The current and lagged variables are appro-priate instruments obtained from exogenous variables. An instrument matrix constructed from the endogenous, predetermined and exogenous regressors that satisfy (7), (8) and (9) respectively, does satisfy the moment condition for the instruments by construction. The equations above show the orthogonality condi-tions20 of the endogenous, predetermined and exogenous regressors in a general

20State uncorrelatedness between the particular internal instruments constructed from

(19)

Arellano-Bond setting respectively.

4.1.2 Tests aiding the Arellano-Bond estimation procedure

Consider the transformed model equation below. In this study, several tests will be performed serving as indicators for the validity of this initial model. To illustrate those tests clearly, the equation below is used as auxiliary equation on which the tests would be applied.

∆yi,t= α∆yi,t−1+ ∆x0i,tβ1+ ∆vi,t0 δ1+ ∆v0i,t−1δ2+ ∆τt+ ∆i,t

In this model, all21unlagged regressors are assumed to be endogenous.

Addition-ally, all regressors, except age, years of work experience and age2, are included in the model with the current realizations and the first lagged realizations. The variables mentioned in the previous sentence are only included in the model with the contemporaneous realization. Otherwise, extreme multicollinearity will occur in the model. Moreover, from the dependent variable, only the lagged realization is included as regressor. Because of the above mentioned β2, γ1 and

γ2 are set to zero (see (6) in section 4.1.1).

The tests to start with are the tests for autocorrelation in the error terms. Firstly, the model satisfies first order autocorrelation in the error terms. This autocorrelation must be present since both ∆i,t and ∆i,t−1 share the term

i,t−1. Although the presence of first order autocorrelation in the transformed

error terms holds by construction, it is tested by considering the p-value of the outcome of the autocorrelation test to verify it. The null hypothesis of the test is: there is no first order autocorrelation in the (transformed) error terms. If this p-value is below 0.05, the author of this study assumes that there is first order autocorrelation in the error terms present in the model. This p-value is chosen as limit based on the fact that using a very low p-value as limit decreases the chance on a type 1 error considerably.

Contrary to the presence of first order autocorrelation in the transformed error terms, as already has been stated in 4.1.1, one does not want to have second order autocorrelation in the transformed error terms. This could be a sign of wrongfully omitting variables from the model. The null hypothesis

21

Except the the lagged dependent variable and the regressors that are obviously exogenous, such as age, age2 and years of work experience (xi,t).

(20)

of testing this autocorrelation is: there is no second order autocorrelation in the (transformed) error terms. Absence of second order autocorrelation in the transformed error terms is assumed if the p-value of this test is larger than 0.5. Whilst performing the estimation procedure for the initial model, the model equation on the previous page, the p-value for the test regarding second order autocorrelation is very high (between 0.8 and 0.9), so we trust second order autocorrelation is absent.

The assumption that there is no second order autocorrelation in the error terms, is a very important one, therefore the probability of having a type 2 error should be decreased by using a high p-value as limit, as was explained in the previous paragraph, to accept the null.

Furthermore, the validity of the initial instrument set needs to be tested. The whole set of instruments must be valid for the regressors, even as all of the instruments separately. This will be tested using the Hansen test of over-identified restrictions and the incremental Hansen test respectively. The null hypothesis of the first test can be interpreted as: validity of the instrument set. If the p-value of this test appears to be higher than 0.5, the instrument set is assumed to be valid. The null hypothesis of the incremental Hansen test is validity of every arbitrary subset of instruments22 of all of the regressors. If the p-value of this test turns out to be higher than 0.2 for all of the regressors, it is assumed that all instruments are separately valid as well. The same arguments as in the previous two paragraphs can be made regarding the decision of the limits for the p-values. The limits are chosen whilst the author on the one hand wants a model that is not too restrictive23 and on the other hand wishes to decrease the chance of assuming instruments to be valid while they are actually not valid.

If the initial model (6) passes all the tests described above, adjusting the model can start. From now on, the real nature24 of the regressors will be exam-ined. This will be done by using the incremental Hansen test. Since all25 the current regressors were assumed to be endogenous (see (7)) the first action is to

22The lagged (and contemporaneous) realizations of one specific regressor.

23Meaning that the limits of the p-values do not need to be too high to pass, otherwise there

will be too few instruments passing those tests.

24

Endogenous, predetermined or exogenous.

25

(21)

add the first lags of a particular regressor as instruments to the instrument set, starting with the regressor of which the p-value of the incremental Hansen test is the highest. At the same time all other instruments, besides the instruments that are added to examine the nature of the corresponding regressors, are as-sumed to be valid. If the p-value of the incremental Hansen test for the added instruments of the particular regressor is higher than 0.3, the instruments are assumed to be valid and hence the regressor is presumed to be predetermined of nature instead of endogenous. The limit of 0.3 of the p-value is more careful than the initial restriction described in the last paragraph, namely that the p-value of the incremental Hansen test must be higher than 0.2. This has been done inten-tionally since assuming that a regressor is predetermined instead of endogenous (or exogenous instead of predetermined) is a mistake that not should be made.

Secondly, the presumed predetermined regressors (see (8)) can now be in-vestigated. Those regressors could also be exogenous instead of predetermined. To check whether this is true, the contemporaneous realizations of a predeter-mined regressor26will be added as instruments. If the p-value of the incremental Hansen test turns out to be higher than 0.3, the regressor is presumed exogenous (see (9)).

Finally, after performing all steps described above, lagged regressors are erased from the model if the corresponding regression-coefficients are highly in-significant. The methodology described above is used to obtain a valid dynamic model that will be estimated with the Arellano-Bond GMM estimation proce-dure. For more detailed information about the tests described above, one could consult Roodman (2009a)

4.1.3 One-step and two-step GMM

The GMM estimation procedure described in the previous text can be estimated with some variation. Being more specific: the GMM estimation can be performed with a one-step weighting matrix and a two-step weighting matrix in which the residuals of the first step are used. General technical details can be found in the GMM chapter of Cameron and Trivedi (2005). The important information for this study about those two procedures will be explained.

26Starting again with the predetermined regressors of which the instrument set has the highest

(22)

Something to notice within this study is that for one-step GMM estimation, the standard errors are made robust to heteroskedasticity which results in con-sistent estimators of the standard errors. The standard errors in the two-step GMM estimation are in theory asymptotically efficient, but in practice they are often downwards biased. Therefore, Windmeijer (2005) came up with a finite sample correction method so that the estimates in the two-step GMM gained efficiency, lower bias and standard error, compared to the one-step GMM esti-mation. Although, Kiviet et al. (2017) found that the finite sample bias is higher for the 2-step GMM estimation. Therefore, within this study, both the 1-step and 2-step GMM estimation methods will be performed. This will be done to compare the different estimation results.

4.1.4 Blundell-Bond

Next to Arellano-Bond GMM estimation, Blundell and Bond (1998) invented a GMM estimation in which additional assumptions are made compared to Arellano-Bond GMM estimation. Recall (5) below, a basic dynamic model.

y,it =αyi,t−1+ x0i,tβ1+ x0i,t−1β2+ wi,t0 γ1+ w0i,t−1γ2+

v0i,tδ1+ vi,t−10 δ2+ ηi+ τt+ i,t

Contrary to Arellano-Bond, the model equation will be in levels and the instru-ments are first-differenced lagged regressors. The added assumptions are that the transformed instruments are not correlated with the individual fixed effects. If it turns out that Arellano-Bond passes all tests described in 4.1.2, Blundell-Bond 2-step GMM estimation will be performed. If this estimation procedure passes the test regarding the additional assumptions, the results of this regression will be examined. Otherwise, it is assumed that the extra instruments are not valid so that inspecting these outcomes will not make sense.

4.2 Model

This subsection contains a comprehensive model description. It includes cate-gorization of the regressors with respect to their nature, a variable description, the regrssors that are added to the model and more. The first part will point out which variables are included in the model. In the second part, the variables

(23)

be made clear which lagged regressors are removed from the model and which are not.

4.2.1 Variables

In the following, it will be made clear which regressors are taking place in the model and why. Furthermore, all of the variables will be explained using a statistical description. Moreover, how the variables are constructed and how they are measured will be illustrated. All the variables that are part of the model are explained in Table 5 of the Appendix.

The individual fixed effects ηi from (6) are, as already has been noted,

con-stant over time. Choices regarding adding regressors to the model are partially based on the number of changes of a variable. Since first differences are taken27, a variable needs to show enough changes over time. Otherwise, the estimated regression coefficients would be based on very few observations. This is caused by the fact that when a variable does not change, the first difference of this variable will be 0.

Variables such as schooling, type of dwelling that the household inhabits, job characteristics and very specific diseases are variables that do not change much over time. Based on the amount of changes and logical reasoning, a variable is added to the model. Another thing to mention is that multicollinearity between the variables has been taken into account as well. Some health or work char-acteristics variables show very much the same patterns so that including those jointly would cause multicollinearity.

The total number of changes per variable can be found in Table 6 of the Appendix. In the same table, descriptive statistics can be found. Notice that the variables that are not included in the model, due to the fact that they do not change enough over time or cause multicollinearity, are not added to this table28. At the initial model, the variables that are included in the model must show enough29changes over time. Arguments that substantiate this will be given

27

With Arellano-Bond in the model equation and with Blundell-Bond at the instruments.

28The variables that are not added to the model because of multicollinearity or too few

changes are reported in this section.

29

The regressor workmentaldemanding is the variable that showed the least changes over time of the regressors that are included in the model. All omitted variables due to this restriction showed less than 310 changes over time.

(24)

below.

Firstly, all variables that are omitted from the model as a consequence of this restriction are not variables that are expected to have much explanatory power on the dependent variable. Although this is expected, it will also be tested. A regression will be performed in which the variables that were initially omitted are added to the model. The omitted variables are mostly health dummy variables that are 1 in case an individual suffers from a specific disease and 0 otherwise.

Secondly, it is likely that those omitted health dummy variables cause mul-ticollinearity when those are included in the model in addition to the other health variables. Lastly, the few non-zero observations (less than 310 whilst there could be 5399 observations maximum) would subsequently lead to esti-mated coefficients that are based on too little information. Either way, one also could argue that 310 non-zero observations out of 5399 observations is relatively little. Although, a higher limit of changes (say 400), would subsequently lead to a model in which variables are omitted whilst it is initially thought that they have explanatory power on the dependent variable (for example workoutsidereg-ularhours).

This paragraph will describe some of the health characteristic variables that are not included in the model because of the above mentioned reasons. The variable descriptions can be found in Table 6 in the Appendix as well. The health complaint variables that are included are joint, flu, headache and fatigue. Those variables showed enough variation over time. The question that is asked to the respondents about those health complaints is whether the respondent frequently suffers from the mentioned complaints. The respondent answered this question with yes or no. Complaints that showed too few variation were breathing problems and hearth problems. Furthermore, the variable regarding sleeping problems show enough variation but including this variable would cause multicollinearity because this variable behaves very much the same with the fatigue variable. Furthermore, all very specific diseases can not be included in the model. Some of those diseases are Parkinson, diabetis, Alzheimer, (benign) cancer, arthritis, asthma, lung problems and gastric or duodenal ulcers. The variation within those diseases is very limited.

In the text below, the work characteristics variables will be discussed. The variables that are included are workfitseducation, workfitsknowledge, workdirty,

(25)

workdangerous, workphysicaldemanding, workmentaldemanding, workovertime, timepressure and workoutsideregularhours. Notice that those variables are all dummy variables. A short description about those variables can be found in Table 5 of the Appendix. The variables show no multicollinearity and, just as the included health variables, enough variation over time. A great number of variables was not included in the model because of multicollinearity. Variables about the work conditions behave often collinear with workmentaldemanding or workphysicaldemanding. For example whether the job of the respondent can be characterized by lifting heavy objects. A variable that is 1 or 0 for whether these latter mentioned job circumstance is true or not, behaves very collinear with the included variable workphysicaldemanding.

Other variables that are included are SAHS, numberhh, yearsexperience, age, age2, physicalactivityheavy, physicalactivitymoderate, smoke and workhin-der. The variable age2 is added since the relationship between income and age is assumed to be non-linear. In the next subsection, the regressors will be clas-sified to endogenous, predetermined and exogenous. Using this classification, a regression equation will be constructed.

4.2.2 Classification

In the text below, the regression equation will be discussed. First, the variables that have been debated in section 4.2.1 will be categorized. The regression equation can be found in (10). As already has been described in 4.1.2, the initial model (10) consists of all30 variables with the lagged and current realizations. The variable of major interest, SAHS, has been emphasized in (10). All unlagged variables are assumed to be endogenous. So, using the categorization procedure, model (10) will be transformed to model (5) because from then onwards all variables are categorized to endogenous, predetermined and exogenous. The variables that are taking part in the categorization are all variables from Table 5 in the Appendix.

yi,t =α0yi,t−1+ β0SAHSi,t+ β1SAHSi,t−1+ (10)

γ00ri,t+ γ01ri,t−1+ µi+ πt+ i,t

30

Except for age, age2 and yearsexperience. Adding these variables lagged will cause extreme multicollinearity.

(26)

The model passes31all tests regarding autocorrelation and the validity of the instruments, so the categorization technique that is described in 4.1.2 can be used. The p-values corresponding to the outcomes of the test statistics of the incremental Hansen test can be found in columns 3 and 4 of Table 1. P-value 1 is used as indicator for possible predetermined variables32 and p-value 2 is used as indicator for possible exogenous variables. The second column states which regressor is tested and the first column identifies the nature of the in the end established regressor. The number behind the p-values in brackets, describes the degrees of freedom.

Table 1: Arellano-Bond 1-step GMM estimation output

category regressor p-value 1 p-value 2 exogenous physicalactivityheavy 0.541 [5] 0.786 [5] physicalactivitymoderate 0.755 [5] 0.842 [5] flu 0.411 [5] 0.741 [5] workfitseducation 0.649 [5] 0.537 [5] workdirty 0.665 [5] 0.598 [5] workdangerous 0.690 [5] 0.870 [5] workweekend 0.901 [5] 0.409 [5] workevening 0.365 [5] 0.405 [5] predetermined SAHS 0.710 [10] 0.150 [10] workhinder 0.707 [5] 0.273 [5] smoke 0.800 [5] 0.265 [5] joint 0.432 [5] 0.293 [5] fatigue 0.777 [5] 0.046 [5] workphysicaldemanding 0.668 [5] 0.267 [5] workovertime 0.711 [5] 0.088 [5] timepressure 0.743 [5] 0.225 [5] workoutsideregularhours 0.958 [5] 0.295 [5] endogenous numberhh 0.295 [5] -workfitsknowledge 0.096 [5] -workmentaldemanding 0.195 [5]

-31P-values: AR(1): 0.005, AR(2): 0.831 Hansen: 0.917.

32As already has been stated: a p-value higher than 0.3 will interpret that a variable is

(27)

One should notice that the variables age, age2 and yearexperience are not added to the table. Those regressors were assumed exogenous and passed the incre-mental Hansen test during all stadiums of the classification procedure.

Furthermore, the degrees of freedom is 5 for all tests in which an initially assumed endogenous dummy variable is tested on predeterminedness. This will be illustrated using (11) below. In (11), the instrument set that results from one predetermined regressor where the limit of lags that is used as instruments is 3 is illustrated. In (11), wi,t is an internal instrument constructed from one

particular predetermined regressor, where i stands for individual i and where t is the time point of the instrument. The total number of instruments is equal to the number of columns, which is 14.

         0 0 0 0 0 0 0 0 0 0 0 0 0 0 wi2 wi1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 wi3 wi2 wi1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 wi4 wi3 wi2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 wi5 wi4 wi3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 wi6 wi5 wi4          (11)

The instruments that are being tested in the incremental Hansen test are dis-played column 1, 3, 6, 9 and 12. Therefore, the degrees of freedom is 5. Now, if the incremental Hansen test is not rejected, so the regressor is assumed to be predetermined and an instrument matrix is constructed from this regressor such as in (11), E[zi,t0 ∆i,t] = 0 holds for t = 2, ..., 7, in which zi,t is the matrix

illustrated in (11).

Now the regressors have been classified, the Arellano-Bond GMM estimation with a better33 instrument set will be performed. For endogenous regressors,

lags 2 and 3 will be used as instruments. The first, second and third lag of the predetermined regressors (see (11)) will be used as instruments whereas the contemporaneous realization and lags 1, 2 and 3 will be used as instruments from exogenous regressors. Lag 3 has been chosen as limit since using too many instruments weakens the Hansen-test of joint validity (Roodman, 2009b).

33Better in the sense that there are more strong instruments compared to the situation where

(28)

4.2.3 Elimination of lags

If the coefficient β2 of the first lag of a particular regressor is highly

insignifi-cant and the coefficient is not approximately equal to −αβ1, in which β1 is the

coefficient of the contemporaneous realization of the particular regressor and α is the coefficient of the lagged dependent variable, then the lag of the particular regressor is removed from the model. Why this is done, will be explained in the next paragraph. In the end of this section, it will be stated which lagged regressors are included in the model.

Notice that when it happens that the coefficient of a lagged particular re-gressor is equal to −αβ1as stated in the previous paragraph, one can speak only

of a direct effect of the particular regressor on the dependent variable. Consider the very basic regression equation in (12) in which xi,tis one particular regressor

(not a vector).

yi,t = αyi,t−1+ β1xi,t+ β2xi,t−1+ ηi+ τt+ i,t (12)

Now, the total effect of xi,t on yi,t is equal to (β1+ αβ1L + α2β1L2+ α3β1L3+

...)xi,t+ (β2L + αβ2L2+ α2β2L3+ α3β2L4+ ...)xi,t since xi,t−1 is the one period

lagged realization of xi,t which can be rewritten as Lxi,t. The geometric series

formula can now be used if |α| < 1. The total effect of xi,t on yi,t is then equal

to β1 1−Lα+ Lβ2 1−Lα = β1 1+Lβ2β1 1−Lα which is equal to β1 if β2= −αβ1.

Taking the above into account, removing the lags of regressors can start, starting with the regressors with the most insignificant coefficients that addition-ally have a coefficient where the sign of the coefficient of the lagged realization of the regressor is equal to the sign of the contemporaneous realization of the regressor. Continuing this procedure, at some point there are lagged realizations of regressors included in the model where the signs of some coefficients are oppo-site to the signs of the coefficients of the contemporaneous realizations of these regressors. If the coefficient that corresponds to a regressor of some lagged re-alization is highly insignificant, the lagged regressor is removed from the model. Notice that during all stages of this procedure, the autocorrelation tests and Hansen tests are satisfied. This may prove that the regressor classification is not rejected at the final model. The final model includes a regressor matrix xi,t that contains all regressors while xi,t−1 contains fatigue, workfitseducation,

(29)

5

Results

In this section, the results will be reported thoroughly. In the first subsection, the Arellano-Bond one- and two-step GMM estimation results can be found. The Blundell-Bond will also be discussed. In the second subsection, the robustness of the results will be checked.

5.1 Main results

In Table 2 on the next page, the results of the Arellano-Bond one-step GMM estimation are reported. Unfortunately, the coefficients are mostly not signifi-cant. On the other hand, all tests regarding autocorrelation and validity of the instruments are satisfied, which indicates that the model is not rejected.

The first thing to notice is that the lagged dependent variable lnhourlywage has a coefficient that is significantly different from zero. An example will illus-trate how the coefficient of this variable can be interpreted. If one individual earned last year 10% more wage per hour, the hourly wage of this year will be 2.58% higher. But, this is not the essence of a dynamic model. Fundamental is the long-term effect (total effect) that can be calculated using the coefficient for the lagged dependent variable. That the coefficient of the lagged dependent variable is significantly different from zero, implies that there is not solely a direct effect of the regressors on hourlywage. An explanation on this can be found in the text below (12) two pages back. Additionally, the coefficients of age and age2 are significantly different from zero. Furthermore, the coefficient of yearsexperience shows that more years of experience in a job increases ones income.

The health characteristic variables show unfortunate results in the sense that most of the variables are not significantly different from zero. Notice that the SAHS status is divided in 3 groups such that there are enough respondents per group: poor (omitted), good and excellent. The group poor consists of respondent assessing poor and moderate to their health and excellent consists of both people assessing excellent and very good to their health. Although the coefficients for people assessing health status good and excellent are both above zero, the 95% confidence interval ranges from approximately -0.05 to 0.2. Interestingly, it is found that the total effect of fatigue problems on hourlywage

(30)

is negative, while the direct effect is positive (although insignificant). Moreover, the results show that often suffering from the flu has a negative effect on wage. The coefficients of the control variables that contain work characteristics are most often not significantly different from zero. People with dangerous jobs earn less than other people. Additionally, individuals that work under time pressure on their jobs earn a lower hourly wage than people who do not. This is even significant at a 1% level. Furthermore, although not significant, it is found that the direct effect of having a job that fits the education of people is positive whereas the lagged effect is negative.

Table 2: Arellano-Bond 1-step GMM estimation output

explanatory variable lag coefficient robust standard errors lnhourlywage 1 0.25808 0.05833∗∗∗ numberhh 0 0.00245 0.02273 yearsexperience 0 0.00404 0.00201∗ age 0 0.02677 0.01042∗∗∗ age2 0 −0.00033 0.00010∗∗∗ SAHS (good) 0 0.04029 0.06511 SAHS (excellent) 0 0.05067 0.06770 physicalactivityheavy 0 −0.00108 0.00151 physicalactivitymoderate 0 0.00179 0.00114 smoke 0 −0.01005 0.02749 workhinder 0 0.00626 0.00700 joint 0 −0.02688 0.01384∗ flu 0 −0.01441 0.00822 flu 1 0.00293 0.00856 fatigue 0 0.00259 0.01607 fatigue 1 −0.02012 0.00948∗∗ workfitseducation 0 0.01073 0.00996 workfitseducation 1 −0.00924 0.00786 workfitsknowledge 0 0.00356 0.02148 workfitsknowledge 1 0.01521 0.00867∗ workdirty 0 0.00962 0.00829 workdangerous 0 −0.01828 0.00766∗∗ workdangerous 1 0.00881 0.00691 workphysicaldemanding 0 0.00494 0.01258

(31)

Table 2 – Continued from previous page

explanatory variable lag coefficient robust standard errors workmentaldemanding 0 0.05055 0.05339 workmentaldemanding 1 −0.01052 0.01794 workovertime 0 −0.00445 0.01070 timepressure 0 −0.02274 0.00851∗∗∗ workoutsideregularhours 0 0.02401 0.02829 period 2 −0.06794 0.03240∗ period 3 −0.05354 0.03445 period 4 −0.45924 0.02690∗ period 5 −0.03232 0.01870∗ period 6 0.00071 0.00045 period 7 0.02172 0.01145∗ number of observations 5399 p-value AR(1) 0.000 number of individuals 2000 p-value AR(2) 0.746 number of instruments 348 p-value Hansen 0.829 *,** and *** denotes significance at the 10%, 5% and 1% level respectively

In Table 3 below, Arellano-Bond two-step GMM estimation results can be found. One should notice that the standard errors are Windmeijer corrected. The (in-cremental) Hansen tests and the autocorrelation tests give satisfactory results, so this model is not rejected immediately. The results of both estimation meth-ods will be compared. Most of the variables show more or less the same signs and levels of significance.

Table 3: Arellano-Bond 2-step GMM estimation output

explanatory variable lag coefficient standard error lnhourlywage 1 0.15407 0.05666∗∗∗ numberhh 0 0.01891 0.01909 yearsexperience 0 0.00325 0.00186∗ age 0 0.02555 0.00927∗∗∗ age2 0 −0.00032 0.00008∗∗∗ SAHS (good) 0 0.04393 0.05565 SAHS (excellent) 0 0.03657 0.05743

(32)

Table 3 – Continued from previous page

explanatory variable lag coefficient standard errors physicalactivityheavy 0 −0.00028 0.00120 physicalactivitymoderate 0 0.00042 0.00085 smoke 0 −0.01877 0.02294 workhinder 0 0.00892 0.00683 joint 0 −0.00956 0.01297 flu 0 −0.01310 0.00768∗ flu 1 0.00292 0.00672 fatigue 0 0.01149 0.01349 fatigue 1 −0.00367 0.00770∗∗ workfitseducation 0 0.01242 0.00783 workfitseducation 1 −0.00578 0.00610 workfitsknowledge 0 −0.00036 0.01660 workfitsknowledge 1 0.00909 0.00621 workdirty 0 −0.00099 0.00685 workdangerous 0 −0.01296 0.00597∗∗ workdangerous 1 0.00111 0.00612 workphysicaldemanding 0 −0.00545 0.01104 workmentaldemanding 0 0.00359 0.03390 workmentaldemanding 1 0.00343 0.01499 workovertime 0 0.00172 0.01028 timepressure 0 −0.02152 0.00733∗∗∗ workoutsideregularhours 0 0.04060 0.02527 period 2 −0.07242 0.03769∗ period 3 −0.05479 0.03078∗ period 4 −0.04346 0.02297∗ period 5 −0.02996 0.01602∗ period 6 −0.01673 0.00865∗ period 7 0.01957 0.01371 number of observations 5399 p-value AR(1) 0.000 number of individuals 2000 p-value AR(2) 0.490 number of instruments 348 p-value Hansen 0.829

*,** and *** denotes significance at the 10%, 5% and 1% level respectively

(33)

step method is more efficient, the one-step estimation is preferred. Unreported results show that the Blundell-Bond estimation does not pass the additional Blundell-Bond assumptions. In other words, the assumption that the differenced instruments are uncorrelated with the individual fixed effects has been rejected.

5.2 One-step Arellano-Bond GMM

To debate on the outcomes of this study, robustness checks will be carried out. To check whether the final model (5) and the orthogonality conditions (7)-(9) are robust to changes, the Arellano-Bond one-step GMM estimation will be performed with some changing. Arellano-Bond one-step GMM is considered to be the best acting regression method in terms of the tests so that all checks will be performed on this model.

5.2.1 Changes of the SAHS variable

Since currently the model has been estimated whilst using three groups for the SAHS variable, namely: poor, good and excellent, it might be interesting to per-form the regression when there are only two groups. For example, people who do not have excellent health and people who do. Whilst performing this estimation, all tests regarding validity of the model are passed. Additionally, all coefficients are approximately the same with roughly the same level of significance. But, the coefficient for people that have excellent health, is now equal to 0.01079 with a standard error of 0.01577. This may indicate again that, although not significant, (very) healthy people earn more than people who are not so healthy. Performing the same regression, whilst considering again two groups for the SAHS variable: all people that rate their health higher than poor and the people who rate their health poor. The estimated coefficient for the dummy variable that is 1 for people assessing a health status that is higher than poor, is 0.03529 with a standard error of 0.05579. Again, using this estimation procedure, the tests are satisfied and the other coefficients do not seem to change much rela-tively.

Furthermore, the groups were initially divided into 3 groups to prevent against too many regressors since this badly influences the performance of the re-gression. Although, the results that come up when using all of the SAHS groups could be of interest to display. Using 5 categories for the SAHS variable leads

(34)

to an interesting outcome. It shows that, although using this construction does not represent any significant coefficients either, difference between categories is very hard to find. Outcomes present that the coefficient for people with moder-ate healthiness is equal to 0.03196 (standard error 0.06662) and the coefficients for good, very good and excellent are respectively 0.04673 (0.06589), 0.01572 (0.06796) and 0.03517 (0.06984).

Lastly, the Arellano-Bond one-step GMM estimation will be carried out with only one health variable: the SAHS. Via this way, possible multicollinearity34 of the SAHS with other health variables that is not immediately clear may be avoided. Nevertheless, the results are not very different with respect to the previous results. Coefficients corresponding to people assessing good and excel-lent are 0.03445 (0.06966) and 0.04713 (0.07145) respectively. Furthermore, the outcomes of the tests show satisfactory results35. All coefficients of the other variables are, just as in the previous regression whilst using another composition for the health variables, more or less the same as in the initial model.

5.2.2 Smaller instrument set

The regression estimation outputs that have been generated are largely depen-dent on the instrument set. Therefore, it might be interesting to re-estimate the model with a smaller instrument set. This will be done with collapsing the instrument set, and with reducing the number of lags that is used for the instruments36.

Notice that for all of the unlagged regressors (endogenous, predetermined and exogenous), the third lagged realization is chosen as limit to include in the instrument set for a particular regressor. This bound has been chosen since on the one hand the total number of instruments must not be too large (com-pared to the number of regressors and observations (Roodman, 2009b)) and the

34

If the respondents are somehow possible to rate their health very accurately, it is likely that there is multicollinearity present between SAHS and all other health variables.

35The (incremental) Hansen test(s) and the tests regarding autocorrelation. 36

Remind: at the previous regressions, an instrument set constructed from an endogenous regressor contained the second- and third lagged realization of the regressor. Whilst the first-, second- and third lagged realization were used to construct an instrument set from a prede-termined variable. From an exogenous variable, the contemporaneous, the first lagged, the second lagged and the third lagged realizations are used to construct an instrument set from

(35)

instrument set must be strong enough on the other hand.

Moreover, although the third lag that is chosen as limit seems to be a good choice considering the things mentioned above, it would be interesting to see how the model reacts on a different instrument set. Therefore, the same model will be estimated using an instrument set where second lagged realization of any regressor is chosen as limit to use for the total instrument set. The out-come of this regression can be found in Table 4. Notice that the unreported outcomes of the incremental Hansen test are satisfactory, so that reconsidering the categorization of the regressors seems not necessary. Also the Hansen test for validity of the total instrument set does not ring the bell, as well as the tests for autocorrelation.

It can be seen from Table 4 that the signs of all coefficients of the variables are the same. Except for workdirty, but this variable is the most insignificant one of all variables in both the regression with full instruments as well as in this regression. The coefficient of the variable of major interest, SAHS, has increased in size.

Table 4: Arellano-Bond 1-step GMM using a smaller instrument set

explanatory variable lag coefficient standard error lnhourlywage 1 0.28834 0.06676∗∗∗ numberhh 0 0.01275 0.02550 yearsexperience 0 0.00412 0.00198∗∗ age 0 0.02396 0.01061∗∗∗ age2 0 −0.00031 0.00010∗∗∗ SAHS (good) 0 0.05569 0.06981 SAHS (excellent) 0 0.06559 0.07217 physicalactivityheavy 0 −0.00087 0.00155 physicalactivitymoderate 0 0.00119 0.00115 smoke 0 −0.00636 0.02898 workhinder 0 0.00886 0.00801 joint 0 −0.02843 0.01466∗ flu 0 −0.01315 0.00853 flu 1 0.00365 0.00894 fatigue 0 0.00517 0.01555 fatigue 1 −0.01682 0.01055∗

(36)

Table 4 – Continued from previous page

explanatory variable lag coefficient standard errors workfitseducation 0 0.00719 0.01196 workfitseducation 1 −0.00711 0.00782 workfitsknowledge 0 0.01293 0.02503 workfitsknowledge 1 0.01721 0.00867∗∗ workdirty 0 −0.00036 0.00854 workdangerous 0 −0.02196 0.00835∗∗∗ workdangerous 1 0.00711 0.00724 workphysicaldemanding 0 0.00174 0.01342 workmentaldemanding 0 0.07656 0.06425 workmentaldemanding 1 −0.01481 0.018247 workovertime 0 −0.00573 0.01186 timepressure 0 −0.02408 0.00881∗∗∗ workoutsideregularhours 0 0.01691 0.02993 period 2 −0.04599 0.03289 period 3 −0.03182 0.02492 period 4 −0.02265 0.01748 period 5 −0.01217 0.00964 period 6 −0.00032 0.00025 period 7 0.01900 0.01131 number of observations 5399 p-value AR(1) 0.000 number of individuals 2000 p-value AR(2) 0.641 number of instruments 260 p-value Hansen 0.710

*,** and *** denotes significance at the 10%, 5% and 1% level respectively

Additionally, to decrease the instrument set, one could also collapse the in-strument set. Consider again the inin-strument matrix (11) constructed from a predetermined regressor. This matrix looks different when it would be collapsed (see (13) below). This will result in a smaller instrument set.

         0 0 0 wi2 wi1 0 wi3 wi2 wi1 wi4 wi3 wi2 wi5 wi4 wi3 w w w          (13)

(37)

Performing this regression, the results are very different compared to the regression before. This may be caused by the fact that some instruments sets are not so strong as they were when more instruments were included in the set (not collapsed). Also the signs of the variable of major interest for the groups of good and excellent health are negative. But, the standard error (0.05976) for the ”good” group is way bigger than the coefficient (-0.00138). The standard error (0.06271) for the ”excellent” group is even bigger compared to the coefficient (-0.00104). This regression indicates that the coefficients that correspond to the SAHS are not different from zero. Also the incremental Hansen test of the instrument subset constructed from the SAHS variable, has a lower p-value then before (0.297).

Additionally, the p-value of the AR(2) test is below the limit37of 0.5, namely 0.278. Other variables have also decreased or increased in magnitude and some of them have a different sign than before. But since the test for second order autocorrelation is not passed, and since some p-values of the incremental Hansen test were substantially lower than before, no conclusions will be drawn from this result. The author is aware of the fact that the outcomes of the (incremental) Hansen test(s) are strange. If the uncollapsed instrument set shows satisfactory results regarding the validity, the collapsed instrument sets are expected to be legitimate too. This must in theory be the case, since less moment conditions of the collapsed instrument sets have to be fulfilled.

5.2.3 Less strict constraint on sufficient non-zero observations

Lastly, the model will be examined when the constraint regarding the minimal number of changes has been reduced to 100. Variables that are now included in the model are the health dummy variables38 breathproblems [148], stomach-problems [259], highbloodpressure [289], highcholesterol [197], arthritis [147] and a work dummy variable. The work dummy variable is managerfunction [168].

The model is constructed in the same way as described in section 4.2. Ini-tially, all39 variables are assumed endogenous and are added to the model

in-37This is not a limit by nature, as is already explained in 4.1.2, but this limit has been set

whilst taking into account the chance on type 1 and type 2 errors.

38

[number of changes]

39

(38)

cluding their lagged realizations40. The tests regarding model misspecification showed satisfactory results during all stages. After doing all steps described in section 4.2, the coefficients of variables that were present in the initial model (see Table 2), remained more or less the same. The coefficients for the variable of major interest, SAHS, are 0.04560 (0.06734) and 0.05699 (0.06997) for people assessing good and excellent health respectively. Furthermore, the coefficient of the dependent variable remained significant at the 1% level.

Interesting are the results of the new variables that are included in the model. The coefficient of the lagged variable highbloodpressure is -0.02423 and is even sig-nificant at a 10% level. Additionally, the coefficient corresponding to the lagged realization of breathproblems is -0.04072 (0.01498). Furthermore, the variable highcholesterol shows a coefficient of -0.05186 (0.04027). The other added vari-ables are not even close to significance. It makes sense that the coefficients corresponding to those variables are negative, since bad health possibly affects income negatively. Although, one could argue that within a good developed country such as the Netherlands these results are quite unfortunate. It is still questionable whether the outcomes based only on a small part of the total data is credible.

Referenties

GERELATEERDE DOCUMENTEN

Voor de laag presterende leerlingen is gevonden dat het flexibel omgaan met het aanpassen van het tekstniveau positieve effecten heeft op een growth-mindset, competentiebeleving,

1.3.1 Die doe1 van hierdie studie is om ondersoek in te stel na die daarstelling van 'n struktuur (buite die bestaande, gesekulariseerde skoolstelsel) waar die kind,

Keywords: ANN, artificial neural network, AutoGANN, GANN, generalized additive neural network, in- sample model selection, MLP, multilayer perceptron, N2C2S algorithm,

Additionally, the main themes of this study, such as platform, architecture, or service tend to be overloaded as they are applied distinctively across the different sub-domains

Using a dynamic spatial panel approach and data pertaining to 156 countries over the period 2000-2016, this thesis tests and compares the different spatial econometric models and

172, exposure to the measured vapour concentrations of propylene glycol and glycerol involves a risk of effects on the respiratory tract.. With the other analysed e-liquids, the

worden de LGO’s beperkt door de EU; zo wordt het nationale recht van Aruba, Curaçao en Sint- Maarten deels aangepast aan de Europese wetgeving en wordt de handel in onder andere

However, the other actor misuses the trust early at the collaboration or even at the very beginning and in this special case the financial loss of the trustor actor would be much