• No results found

The effect of insurance coverage on the number of physician visits : a study taking into account endogenous treatment effects and endogenous participation

N/A
N/A
Protected

Academic year: 2021

Share "The effect of insurance coverage on the number of physician visits : a study taking into account endogenous treatment effects and endogenous participation"

Copied!
40
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Master’s Thesis Econometrics

The effect of insurance coverage on

the number of physician visits

A study taking into account endogenous treatment effects and

endogenous participation

Petra van Meel

(10640916)

MSc in Econometrics

Master’s track: Econometrics Date of final version: June 25, 2017 Supervisor: Dr. J.C.M. van Ophem Second reader: Dr. K.J. van Garderen

SUMMARY

This study investigates the effect of health insurance coverage on the number of physician visits in the United States while taking into account both possible endogeneity of private health insurance and endogenous partici-pation. The econometric techniques in this study are predominantly based on a model proposed by Bratti and Miranda (2011). In contrast to these authors, this study avoids the use of maximum simulated likelihood and the associated simulation bias by numerically integrating out unobserved heterogeneity. The estimation results indi-cate that two different data generating processes for the initial physician visit and the number of physician visits conditional on at least one visit exist. Moreover, once the two data generating processes are allowed to differ, no statistically significant evidence for the private insurance variable being endogenous is present. Furthermore, being insured positively affects both the probability to visit the physician for the first time and the number of physician visits conditional on at least one visit.

(2)

i

This document is written by Petra van Meel who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

Contents

1 Introduction 1

2 Literate review 3

2.1 U.S. health insurance market . . . 3

2.2 Endogeneity health insurance coverage . . . 4

2.3 Endogenous participation . . . 5

3 Data and variables 8 3.1 MEPS 2014 data . . . 8

3.2 Variables and hypotheses . . . 9

3.2.1 Demographic variables . . . 9

3.2.2 Socio-economic variables . . . 11

3.2.3 Health-related variables . . . 11

3.3 Descriptive statistics . . . 13

4 Model and methodology 15 4.1 Count data models . . . 15

4.2 Empirical methodology . . . 18

4.2.1 Endogenous treatment and endogenous participation equations . . . 18

4.2.2 Count equation . . . 19

4.2.3 Three-equation model . . . 20

4.2.4 Estimation method . . . 21

5 Results 23 5.1 Unobserved heterogeneity and correlation among error terms . . . 23

5.2 Comparison of Poisson, ET-, EP- and EPET-Poisson . . . 25

5.3 Estimation results EPET-Poisson . . . 26

6 Conclusion 30

Bibliography 33

Appendix I

(4)

Chapter 1

Introduction

Since 1960, an upward trend in United States health care spending is visible. Health care expenditures per capita increased from $5,327 to $9,451 between 2002 and 2015 (EOCD, 2017). At the same time, health care expenditures accounting for 14.0% of the gross domestic product in 2002, increased to 16.9% in 2015. The trend is partly due to inflated prices of drugs, medical devices and health care in general (Moses et al., 2013). However, other factors such as the ageing of the population might additionally contribute to the rise in health care spending as health care demand is, on average, higher for older than for younger individuals (Riphahn et al., 2003). The upward trend is predicted to continue in the foreseeable future: health care expenditures per capita are expected to increase by 4.7% per year between 2016 and 2025 (Department of Health and Human services, 2017). The expected growth in health care spending represents a considerable challenge regarding future health care policy.

In order to mitigate the rise in health care expenditures, it is important to improve knowledge about determinants of health care utilization. This can be accomplished by exploring health care demand and, among others, its interrelation with insurance coverage. The aim of this study is therefore to investigate the effects of demographic, health and socio-economic factors and in particular health insurance coverage on the number of physician visits. Several empirical challenges and estimation issues arise in this study.

Firstly, an endogeneity problem with respect to private health insurance coverage is most probably present. On the one hand, various studies find evidence for so-called moral hazard: being insured (irrespective of the type of health care insurance) affects the decision to seek health care and increases utilization of health care (Shen, 2013). On the other hand, some types of insurance coverage, such as private insurance coverage, are predominantly a choice made by the individual. As a consequence, individuals with a less favourable health status (high-risk types) might have an enlarged incentive to obtain insurance coverage, which is known as adverse selection. In their study of the effect of add-on insurance on the number of hospital visits, Riphahn et al. (2003) state that their results suggest the presence of adverse selection. They however acknowledge that the results cannot be interpreted as irrefutable evidence for this. Even though the evidence is not unambiguous, reverse causality between individual choice insurance coverage and the number of physician visits might be present. Since neglecting endogeneity leads to inconsistent estimates, the possible endogeneity of insurance coverage should be taken into account when estimating the econometric model in this study.

Secondly, an endogenous participation problem exists. Whereas patients themselves decide on the initial physician visit, the patient and physician jointly decide on the treatment, and hence on the number of subsequent physician visits. Consequently, the participation into a physician visit and the intensity of the number of physician visits may be two distinct processes. Different

(5)

CHAPTER 1. INTRODUCTION 2

data generating processes should therefore be allowed for zero physician visits and strictly positive physician visits (Bratti and Miranda, 2011; Shen, 2013).

In order to obtain a first insight into the relation of interest in this study, a literature review on the U.S. health care system is conducted and several methods and results of previous studies accommodating for endogeneity of health insurance or endogenous participation are discussed. Thereafter, the data obtained from the Medical Expenditure Panel Survey (MEPS), which is an ongoing nationally representative survey of the civilian non-institutionalized United States (U.S.) population, is introduced. MEPS provides data on health services used by U.S. citizens, their health care expenditures, health insurance and health status. In addition, data on several demographic and socio-economic factors of the respondents are available. After data cleaning, the characteristics of the data are investigated and a model to investigate the effect of insurance coverage on the number of physician visits is constructed.

Several studies in health economics address either endogenous treatment or endogenous par-ticipation. However, aforementioned findings motivate the construction of a model for the effect of health insurance coverage on the number of physician visits that takes into account both the possible endogeneity of health insurance and endogenous participation. This study therefore ap-plies a method proposed by Bratti and Miranda (2011), who use maximum simulated likelihood to introduce an estimator for models where a count variable (the number of physician visits) is affected by an endogenous binary treatment (health insurance coverage) and in addition account for either sample selection or endogenous participation. The authors estimate a three-equations model and introduce a random variable representing unobserved individual heterogeneity to allow for correlation among the treatment, participation and the count variable. In contrast to Bratti and Miranda (2011), this study numerically integrates out the unobserved heterogeneity and therefore avoids the use of maximum simulated likelihood. Although both methods impose the same distributional assumptions and are thus equally restrictive, integrating out is preferred as it avoids simulation bias.

This study is organised as follows. Chapter 2 contains a literature review on the possible endogeneity of health insurance and discusses the issue of endogenous participation. Chapter 3 subsequently describes and analyses the MEPS 2014 dataset. Additionally, health utilization determinants are constructed and hypotheses are formulated. Thereafter, theory on count data models and the empirical methodology applied in this study are described in Chapter 4. The estimation results of this study are then presented in Chapter 5. Lastly, Chapter 6 draws a conclusion from the results.

(6)

Chapter 2

Literate review

In order to gain insights into possible endogeneity of insurance coverage and endogenous partic-ipation, this chapter discusses these issues and describes results from earlier research in respect of the effect of health insurance on health care utilization. First, a brief explanation of the U.S. health insurance system is presented in Section 2.1. Subsequently, Section 2.2 and Section 2.3 discuss possible methods to accommodate for endogeneity of health insurance coverage and endogenous participation, respectively.

2.1

U.S. health insurance market

This section briefly introduces the health insurance system in the United States by using infor-mation obtained from the U.S. Census Bureau (Smith and Medalia, 2015) and the U.S. federal government. Health insurance is an insurance against the risk of medical expenses. The two main forms of health insurance in the U.S. can be classified as public and private care coverage, with private coverage being the primary source of insurance in the United States.

Private health insurance consists of insurance through an employer or union or insurance coverage from an insurance company purchased by an individual itself. This type of insurance coverage is predominantly a choice of the individual (a job choice may be based on the type or amount of coverage provided through an employer) and consequently entails the risk of endogeneity. This is explained in more detail in Section 2.2.

Other than private coverage, individuals can be covered by insurance provided by the gov-ernment. This so-called public insurance is offered to low-income families and elderly people and consists, among others, of Medicare, Medicaid, Children’s Health Insurance Program (CHIP) and TRICARE. Medicare is a federal social insurance plan intended for elderly individuals (65 years and over), disabled individuals younger than 65, or individuals with end-stage kidney diseases. Medicaid on the other hand offers free health care (or at a low cost) to low-income families and pregnant women. Furthermore, CHIP provides health care coverage for uninsured children in families where income exceeds the Medicaid thresholds, yet is too low to obtain pri-vate insurance. Finally, TRICARE provides health care benefits for (former) military personnel and their families.

Besides the insured, there is a relatively large group that is completely uncovered throughout the year (the uninsured). In order to, among others, mitigate the large number of uninsured individuals in the U.S., the Patient Protection and Affordable Care Act (also known as Oba-macare) has been signed into law in 2010. Between 2010 and 2020, provisions such as taxes, subsidies and regulations of the health care system are going into effect. These provisions are intended to enhance quality, affordability and hence availability of health insurance and to

(7)

CHAPTER 2. LITERATE REVIEW 4

minish health care expenditures. Under Obamacare, an individual who is able to afford health insurance, however is yet uninsured, is obligated to pay a fine in the form of a tax penalty. Even though the uninsured rate has been declining in recent years, still 10.4% of the United States population was uninsured in 2014 (Smith and Medalia, 2015). Similarly as in the case of private insurance, being uninsured is a choice of the individual.

2.2

Endogeneity health insurance coverage

This section presents the issue of endogeneity of health insurance coverage, discusses for which types of insurance coverage endogeneity exists and describes methods employed on this subject in previous studies.

Due to the fact that utilization of health care as well as health insurance coverage are affected by an individuals’ health status (Hidayat and Pokhrel, 2009), an endogeneity problem with respect to insurance coverage is most probably present. On the one hand, various studies find evidence that being insured, irrespective of the type of insurance, affects the decision to seek health care and increases utilization of health care (Arrow, 2001; Shen, 2013). This is known as moral hazard, a phenomenon observed in many different insurance markets and which is a consequence of informational asymmetries (Chiappori and Salanie, 2000). On the other hand, when insurance coverage is (at least partly) a choice made by the individual, adverse selection arises: individuals with a less favourable health status might have an enlarged incentive to obtain insurance coverage (Chiappori and Salanie, 2000). As public coverage is provided by the government, and is therefore not an individual choice, this issue does not play a role for the publicly insured. Adverse selection is, however, relevant for individuals that are uninsured or have private insurance coverage. For these groups, it is widely acknowledged that reverse causality between individual choice insurance coverage and the number of physician visits is present. Private insurance coverage is therefore most probably endogenous (Cameron et al., 1988; Riphahn et al., 2003; Shen, 2013).

In case of endogeneity, regression estimates measure only the magnitude of the relation between the dependent variable and endogenous explanatory variable instead of both the mag-nitude and causal direction (Cameron and Trivedi, 2005). As a result, parameter estimates are inconsistent and do not reflect the true population parameter. It is therefore important to take into consideration the possible endogeneity of insurance coverage (Hidayat and Pokhrel, 2009). This section continues with discussing various methods proposed for estimating models in case of endogenous treatment.

A first possibility to address the problem of endogeneity of private insurance coverage is to use linear instrumental variables (IV) or generalized method of moments (GMM). Hidayat and Pokhrel (2009) employ this method to estimate the relationship between insurance status and the number of outpatient visits to health providers. The authors argue that IV and GMM both result in consistent parameter estimates when unobserved heterogeneity is correlated with regressors (although IV methods lead to incorrect standard errors in this case). Furthermore,

(8)

CHAPTER 2. LITERATE REVIEW 5

consistent estimation requires relevant and exogenous instruments (Cameron and Trivedi, 2005, p. 197). However, as is noted by Shen (2013), instruments that are sufficiently correlated with health insurance coverage are often also correlated with health care usage. Dealing with endogeneity of health insurance by using instruments is therefore complex.

Another method to handle dummy endogenous variables in a continuous setting (where health care expenditures serve as a proxy for health care utilization) is to employ a Heckman two-step selection model (Cameron and Trivedi, 2005, p. 550). The first step is to estimate a participation equation (insurance choice) and to construct a Heckman correction term. The model for health expenditures is then augmented by the correction term to ensure that the con-ditional expectation of the error term is zero and as a result, the health expenditures equation in step 2 can be consistently estimated by OLS. However, a limitation of this method is that func-tional form and distribufunc-tional assumptions need to be made. See Coslett (1991), Newey (2009) and Robinson (1988) for semi-parametric alternatives that relax the distributional assumptions of the parametric Heckman set-up.

A third alternative is to use data from randomized experiments to obviate the problem of en-dogeneity (Deb and Trivedi, 2002; Manning et al., 1987). As argued by Deb and Trivedi (2002), random assignation of insurance plans ensures that no choices of individuals are involved and as a result insurance coverage can be treated as exogenous. The authors however acknowledge that experimental data, if available, is often obsolete and no longer suits altered health- and insurance markets.

2.3

Endogenous participation

In addition to the endogeneity problem outlined in the previous section, an endogenous par-ticipation problem with respect to health care utilization most probably exists. The contact decision, i.e. the first physician visit, is initiated by the patient itself and thus solely depends on the individual. However, the patient and physician jointly decide on the treatment and hence on the number of subsequent physician visits (the level decision of health care utilization). The decision-making process thus consists of two stages and the participation into a physician visit and the intensity of physician visits may be governed by two distinct processes.

As a consequence of the two-stage decision making, different data generating processes for both stages should be allowed for (Bratti and Miranda, 2011; Pohlmeier and Ulrich, 1995; Shen, 2013). Notwithstanding, several studies do not distinguish between the contact decision and the decision for health care utilization once initial contact is made. Riphahn et al. (2003) for example, focus solely on the patient as the decision-maker for health demand. A principal agent model, in which it is assumed that the demand for medical services is predominantly determined by the physician, is on the other hand used by Anderson et al. (1981). Pohlmeier and Ulrich (1995) stress that neglecting the two stages of the decision making process, and thus estimating the decision with only one regression equation, induces inconsistent parameter estimates and an incorrect economic interpretation. Accommodating for the two-part decision making process,

(9)

CHAPTER 2. LITERATE REVIEW 6

and hence allowing parameters and even explanatory variables to differ among the two stages, is therefore of great importance. Extensive research on the decision-making process in health care utilization is conducted and the next paragraphs discuss some possible methods to deal with the contact and level decision separately.

From the literature it appears that the dual decision structure is frequently employed by the two-part model (TPM). This model is for example applied in the health demand-related studies of Deb and Trivedi (2002), Duan et al. (1983) and Pohlmeier and Ulrich (1995). According to Deb and Trivedi (2002), part of the popularity of the TPM model is due to the fact that it takes into consideration the large fraction of zeros that is often present in health care utilization data. More importantly however, is that the processes underlying the contact and level decision are not constrained to be the same. More specifically, the TPM consists of two parts and can be estimated by maximum likelihood. The first part relates to the participation decision: a binary (zero/positive) outcome model such as logit or probit is used to estimate the probability to seek health care (Cameron and Trivedi, 2005, p. 545). The second part models the level of utilization conditional on positive usage. The TPM can be formulated as:

f (y|x) = (

Pr[d = 0|x] if y = 0 Pr[d = 1|x]f (y|d = 1, x) if y > 0,

where y is the dependent variable, x denotes a vector of explanatory variables and d is a binary indicator that equals 1 if usage is positive and zero otherwise. The TPM thus permits the parameters for Pr(y = 0) and E(y|y > 0) to be different (Deb and Trivedi, 2002). Notice that this model assumes the d and y processes to be uncorrelated.

The second part of the TPM can be modelled as either a continuous or discrete random variable. A continuous two-part model is used by Duan et al. (1983) to investigate the effect of insurance plans on health care expenditures, which is a proxy for health care utilization. They define a probit model for zero or positive health care expenditures and a log-normal model for expenditures conditional on y > 0. By comparing the forecast performance (measured by the mean squared forecast error) of a one-part and two-part model, they show that the predictive performance of the two-part model is better. This underlines the importance of allowing for two different DGPs for the contact and level decision.

In the specific case of count data, the TPM is generally referred to as the hurdle model (Cameron and Trivedi, 2005). Brown et al. (2005) and Pohlmeier and Ulrich (1995) estimate a Negative Binomial distributed hurdle model (detailed in Section 4.1) for the number of doc-tor visits/days hospitalized in Mexico and the number of visits to general practitioners and specialists in Germany, respectively. Both studies estimate the second-stage decision by a zero-truncated Negative Binomial model. Specification tests performed by Pohlmeier and Ulrich (1995) evidence that, in line with expectations, two different processes should be allowed for the two parts of the decision-making process. In accordance with this finding, Gerdtham (1997) shows that two-part models provide a better fit for health care utilization data than regular Poisson or Negative Binomial models.

(10)

CHAPTER 2. LITERATE REVIEW 7

As an alternative to differentiating between participants and non-participants as in the TPM, a distinction between frequent and infrequent participants can be made. For this purpose, Deb and Trivedi (2002) employ a finite mixture variant of the latent class model (LCM) in order to restore ’groups’ from the data (Oberski, 2016). While the parameters within these groups are expected to be equal, they differ across groups. By a comparison of the TPM and LCM based on several model selection tests, Deb and Trivedi (2002) show that the LCM is superior to the TPM. However, Cameron and Trivedi (1998) argue that the TPM is more appealing than the LCM to describe the number of recreational trips. Hence it depends on the application which of both models is best to be used.

In summary, various health-related studies find that health insurance coverage is endogenous and that endogenous participation is present. Some widely applied methods dealing with either of these issues have been discussed. However, with the exception of Shen (2013), few studies address both problems at the same time. Shen (2013) uses a semi-parametric approach to analyse a three-equation health decision system (insurance coverage, utilization and the level of health care expenditures). By doing so, she accommodates for both issues described in this chapter. This study also takes into account both issues, however, considers the number of physician visits instead of health care expenditures as a proxy for health care utilization. Before outlining the empirical methodology, the data and variables used in this study are presented in Chapter 3.

(11)

Chapter 3

Data and variables

This chapter introduces the MEPS dataset and presents some descriptive statistics. Section 3.1 elaborates on the data source and defines the final sample. Section 3.2 thereafter operationalises the dependent and relevant explanatory variables. In addition, hypotheses on the effects of the explanatory variables based on economic theory and findings of earlier studies are formulated. Finally, the data is analysed in Section 3.3 in order to gain insights into its characteristics.

3.1

MEPS 2014 data

The data used in this study is obtained from the Medical Expenditure Panel Survey (MEPS) (Cohen et al., 2009), which is an ongoing nationally representative annual survey of the civil-ian non-institutionalized U.S. population. By combining information obtained by surveying households, employers and medical providers, MEPS provides data on health care used by U.S. citizens, their health care expenditures and health care insurance. In addition, data on health status, demographics and socio-economic factors of the respondents are available. The way in which the MEPS data is collected ensures the provided information to be fairly reliable. Due to the fact that data collected from medical providers such as physicians, hospitals and pharmacies is used to edit and supplement data on health components provided by households, recall errors are for instance diminished.

More specifically, this study uses the 2014 MEPS data. The sample originally consists of 33,162 individuals in 13,421 families. However, the sample size is reduced as a consequence of the fact that incomplete observations with respect to e.g. mental health and industry occupation are eliminated. Furthermore, questions regarding several health issues were not incorporated in the surveys for respondents under eighteen. For this reason, only respondents of eighteen years and older are included in the subsample. Moreover, the maximum likelihood estimation method used in this study (discussed further in Section 4.2) requires independence of observations. MEPS however uses household-level data and as a consequence of unobserved family factors, observations from within a family (cluster) are possibly dependent. To mitigate this problem, a random sample is drawn in order to select only one individual per household. The final sample consists of 7,609 observations.

It should be emphasized that no values in respect of (endogenous) insurance coverage are missing and selectivity issues on account of this are therefore expected to be minor. However, whether the final sample is representative of the population cannot be evaluated. The next section continues with the operationalisation of the dependent variable of interest and possible determinants of health care utilization.

(12)

CHAPTER 3. DATA AND VARIABLES 9

3.2

Variables and hypotheses

This section operationalises the variables used in this study and formulates hypotheses on the effects of the explanatory variables on the dependent variable. The dependent variable of interest is the number of visits to a physician, which is a discrete outcome variable that can take numerous non-negative values without an upper limit. More specifically, the number of visits is a count variable and takes values between 0 and 75 in the final sample. Since outliers contaminate standard error estimation (Cameron and Trivedi, 2005, p 361), one exceptionally large observation (186 physician visits) has been excluded from the analysis.

To control for insurance coverage, dummy variables indicating whether the respondent is privately insured or publicly insured are taken into account (reference category is being unin-sured). The privately insured group consists of the respondents who purchased an insurance contract themselves or have insurance through the employer, whereas the publicly insured ob-tain coverage by Medicaid, Medicare, CHIP and other social security benefit programs. As a consequence of moral hazard (Section 2.2), being insured is expected to be positively related to the number of physician visits. Furthermore, as the health of a publicly insured individual is, on average, poorer, the publicly insured are expected to visit the physician more often than the privately insured. Recall that private insurance is treated as endogenous. Regarding this type of insurance coverage, a distinction could be made between respondents who purchased insurance themselves and those that are insured through the employer. This would result in two endogenous variables. As this would substantially complicate the method outlined in Section 4.2, private insurance is not divided in subcategories in this study.

The explanatory variables are incorporated in the analysis based on economic theory and empirical results from previous studies on health care demand. Shen (2013) uses predominantly the same explanatory variables as this study. Although the main dependent variable in Shen (2013) − the total expenditures for health care services − differs from the dependent variable in this study, the two dependent variables are clearly interrelated. Consequently, the findings of this study can, to some extent, be compared to the results of Shen (2013). However, as there may be a discrepancy in definitions of some variables, careful interpretation of differences and similarities in the results is required. The next subsections introduce the (presumed) exogenous explanatory variables, which are categorized in demographic, socio-economic and health-related factors.

3.2.1 Demographic variables

Firstly, the demographic factors included in the analysis are introduced. A dummy variable for gender serves as an explanatory variable. The dummy equals one if the respondent is a female and zero otherwise. Among others, Owens (2008) shows that women utilize health care services more frequently than men and argue that this might be a consequence of menopausal symptoms and effects such as an increased risk of breast cancer. In addition, several authors such as Verbrugge et al. (1987) argue that women are more cautious than men and are thus

(13)

CHAPTER 3. DATA AND VARIABLES 10

more likely to seek help in case of illness or as a means of prevention. Consequently, females are expected to visit the physician more often than males.

Furthermore, a dummy variable for marital status is incorporated in the analysis. The dummy variable is equal to one if the respondent is married and zero otherwise. Many studies find a link between marital status and health care utilization. Shen (2013) for example, argues that being married increases the likelihood of health care utilization by two to three percentage points. This finding is supported by Hidayat and Pokhrel (2009). By using GMM estimation, they show that being married increases the number of public outpatient visits by five percent on average. Furthermore, Pohlmeier and Ulrich (1995) evidence that non-married individuals are more reluctant to visit a physician for the first time. Meanwhile, they show that no significant differences between married and non-married individuals are present after the initial physician visit. A possible explanation for these findings could be that married couples might be concerned about their partner’s health and stimulate each other to seek medical care if deemed necessary. It is therefore hypothesized that being married positively affects the physician visit intensity.

In addition, a dummy variable indicating the race of an individual is created. More specif-ically, the dummy variable equals one if the respondent is white, whereas it is zero if the respondent is non-white. There is a rich literature on differences in health status and health use between whites and non-whites. Hummer (1996) for instance argues that non-whites generally have a poorer health than their white counterparts. This statement is confirmed by Cornelius (1993), who shows that black and Hispanic children are more likely to have a fair or poor health than white children. These results suggest that the necessity of health care for non-whites is larger than for whites. However, white individuals use more ambulatory care and mental health care than non-whites and Hispanics (Cornelius, 1993). This utilization-difference might be a consequence of cultural and/or behavioural differences as well as a difference in access to med-ical care between whites and non-whites. Even though whites are generally more healthy, it is expected that belonging to the white race positively affects the number of physician visits.

Besides the race dummy, another dummy variable is constructed for ethnicity. This variable equals one if an individual belongs to the Hispanic ethnicity, and is incorporated in the analysis as a consequence of the fact that Hispanics may be any race, i.e., individuals can be Hispanic alongside any racial category. Similarly as for the race dummy, it is expected that belonging to the Hispanic ethnicity negatively affects the number of physician visits.

Finally, to control for regional differences, dummy variables for residential regions are cre-ated. In particular, the regions are Northeast, Midwest, South and West, where the Northeast region serves as the reference category. The Northeast region generally has a higher wealth level than the other regions (Gottschalck et al., 2013) and as a result, individuals living in the North-east have a larger budget for health care. Therefore, the hypothesis in respect of residential regions is that the effects of living in the Midwest, South and West on the number of physician visits are negative compared to living in the Northeast.

(14)

CHAPTER 3. DATA AND VARIABLES 11

3.2.2 Socio-economic variables

Secondly, several socio-economic variables are considered. Educational attainment is considered first. Three dummy variables are created: less than or equal to 12th grade with no high school/GED diploma, high school/GED diploma or college without a four year degree and finally, Bachelor’s or Master’s degree or higher. The dummy for less than or equal to 12th grade serves as the reference category. Coburn and Pope (1974) state that well-educated individuals belong to social networks that encourage preventive medical care and are therefore more likely to utilize health services. Ross and Wu (1995) argue however that highly educated individuals are more healthy than less educated individuals due to their benefits in economic positions and healthier life styles. According to this argument, well-educated individuals require less medical attention and consequently have a lower number of physician visits. Therefore, the effect of education on health care utilization is ambiguous.

Subsequently, total family income divided by the number of household members is included in the analysis (divided by 1000 for scaling purposes). In the way this variable is defined, family size is therewith accommodated for. Hidayat and Pokhrel (2009) demonstrate that both GMM on the number of public outpatient visits and binary logit on the decision to utilize health care yield significant positive estimated income coefficients. In respect of health care utilization, these findings are consistent with those of Gerdtham (1997). He shows that income significantly positively affects the decision regarding the first physician visit. On the contrary, somewhat counter-intuitive results are found by Shen (2013) and Riphahn et al. (2003). They find no evidence for a significant effect of income on health care utilization and the number of doctor visits, respectively. However, as a higher income implies more purchasing power in general and in particular for health care, it is hypothesized that income positively affects the physician visit intensity.

The final socio-economic variable encompasses whether an individual has a physically de-manding job. More specifically, a dummy variable that equals one if the respondent is employed in farming, fishing, forestry, construction, extraction, maintenance, production, transportation, material moving or military specific occupations is constructed. Heavy physically demanding occupation increases, among others, musculoskeletal complaints (de Zwart et al., 1997). Physi-cally demanding jobs are therefore generally associated with more injuries and health problems which consequently result in a higher number of physician visits.

3.2.3 Health-related variables

Lastly, variables related to health characteristics are taken into consideration. One of these variables is the age of the respondents. According to the National Bureau of Economic Research (2017), health conditions deteriorate with age. For this reason, it is hypothesized that older individuals visit physicians more often and that the magnitude of this relation increases with age. Riphahn et al. (2003) provide empirical evidence for the latter. By using a bivariate random effects estimator in a count data setting, they show that the quadratic effect of age

(15)

CHAPTER 3. DATA AND VARIABLES 12

significantly positively affects the number of physician visits. In accordance with this result, Pohlmeier and Ulrich (1995) find a convex relationship between age and the initial physician visit by using a Negative Binomial hurdle model. To allow for such a non-linear relationship, both age and its square are included in the analysis.

To control for the physical health status of respondents, several comorbidities are consid-ered. A variable that counts the number of the subsequent health problems from which the respondent suffers is constructed: high blood pressure, stroke, emphysema, high cholesterol, cancer, arthritis, asthma, diabetes and heart diseases. Intuitively, the higher the number of co-morbidities an individual suffers from, the larger the physician visit intensity is. This reasoning is supported by the findings of Shen (2013).

To take into account the mental health status of respondents, a dummy variable that serves as an indicator for the presence of psychological distress is incorporated in the analysis. The construction of this variable is based on the Kessler Index (K6) (Kessler et al., 2003) of non-specific psychological distress during the past 30 days. The Kessler Index can take values between 0 and 24, whereby a higher score indicates a greater tendency towards mental disability. According to Kessler et al. (2003, p. 188), the optimal cut-point between being mentally healthy and experiencing psychological distress is a score of 13. Therefore, the mental dummy equals one if the value is larger than 12 and zero otherwise. Similarly as for physical health, it is expected that experiencing psychological distress leads to an increase in the number of physician visits. Again, this hypothesis is in accordance with the empirical results of Shen (2013).

Moreover, the perceived health status of a respondent is incorporated in the analysis. The respondents were asked to rate the health status of each family member according to the fol-lowing categories: excellent, very good, good, fair, and poor. Based on these categories, a discrete variable that takes values between 1 and 5 is constructed (where values 1 and 5 indi-cate excellent and poor perceived health, respectively). Miilunpalo et al. (1997) finds an inverse relation between perceived health status and the number of physician contacts. Similar results are expected for this study.

Furthermore, body mass index (BMI) is taken into consideration. An individual is consid-ered obese if his or her body mass index exceeds 30 (Centers for Disease Control and Prevention, 2007). Obesity causes several health issues such as cardiovascular diseases and diabetes (Na-tional Institutes of Health and others, 1998). Consequently, the necessity of health care for obese individuals is most probably larger than for non-obese individuals, resulting in a larger number of physician visits. A dummy variable that equals one if the BMI of the respondent exceeds 30 is therefore included in the analysis.

The final health-related dummy variable indicates whether a respondent is a current smoker: the variable equals one if the respondent is a smoker and is zero otherwise. As smoking can cause a broad range of diseases, including lung cancer (US Department of Health and Human Services and others, 2014), smokers are expected to suffer from a poorer health than non-smokers. Smokers therefore probably require more health care, resulting in an enlarged number

(16)

CHAPTER 3. DATA AND VARIABLES 13

of physician visits.

3.3

Descriptive statistics

The previous two sections elaborated on defining the sample and variables. Next, some de-scriptive statistics are discussed in order to gain insights into the characteristics of the data. The number of observations and corresponding shares in the dataset (in percentage of the total number of observations) for the dependent and explanatory variables are presented in Table 3.1. The variables other than the dummy variables are split into categories to enhance insight into their distribution. It is to be noted that this transformation is not applied when estimating the model: the variables remain as defined in Section 3.2. Some noteworthy results are discussed.

Table 3.1: Descriptive statistics of the data

N % All 7609 100% Number of physician visits

0 3144 41.3% 1 − 2 2237 29.4% 3 − 4 935 12.3% ≥ 5 1293 17.0% Insurance coverage Public insurance 966 12.7% Private insurance 5316 69.9% No insurance 1327 17.4% Age 18 − 30 1543 20.3% 30 − 49 3490 45.9% ≥ 50 2576 33.8% Gender Female 3790 49.8% Male 3819 50.2% Race White 5161 67.8% Non-white 2448 32.2% Ethnicity Hispanic 2081 27.3% Other 5528 72.7% Region Northeast 1163 15.3% Midwest 1443 19.0% South 2855 37.5% West 2148 28.2% Marital status Married 3553 46.7% Other 4076 53.3% N % Education 0 − 12th grade 1037 13.6% High school/GED degree 4382 57.6% Bachelor’s/Master’s degree 2190 28.8% H.h income/h.h members < $20.000 3546 46.6% $20.000 − $30.000 1286 16.9% $30.000 − $50.000 1488 19.6% ≥ $50.000 1289 16.9% Work industry

Physically demanding job 1784 23.4% Other 5825 76.6% Number of comorbidities

0 3628 47.7% 1 1900 25.0% ≥ 2 2081 27.3% Presence of mental illnesses

Yes 165 2.2% No 7444 97.8% Perceived health status

Excellent 1952 25.6% Very good 2777 36.5% Good 2182 28.7% Fair 607 8.0% Poor 91 1.2% Obesity Yes 2421 31.8% No 5188 68.2% Current smoker Yes 1137 14.9% No 6472 85.1%

Number of observations (N) and shares (%) of the relevant variables.

A large number of respondents did not utilize health care in 2014: 3,144 (41.3%) of the 7,609 respondents did not visit the physician at all throughout this year. Although 83% of the respondents visited the physician at most 4 times, the distribution of physician visits has a

(17)

CHAPTER 3. DATA AND VARIABLES 14

long right tail with a maximum value of 75. A histogram of the number of physician visits is provided in Appendix II. Regarding insurance, the majority of the sample (69.9%) has private insurance coverage, while 12.7% is covered by public insurance. In addition, most respondents are white (67.8%) and non-Hispanic (72.7%). Around halve of the sample is married and 86.4% of the respondents attained at least a high school or GED degree. Moreover, the income per household member was less than $20.000 for 3,546 respondents (46.6%) and more than $50.000 for 16.9% of the sample. When it comes to industry occupation, a little less than 25% of the respondents has a physically demanding job. Notably, around a quarter of the sample suffers from 2 or more of the listed comorbidities. In contrast, only small fractions of the respondents are psychologically distressed (2.2%) or have a poor perceived health status (1.2%). Meanwhile, slightly under two-thirds of the respondents reported to have an excellent or very good health status. Finally, 31.8% of the respondents suffer from obesity and 14.9% of the respondents are smokers.

Table 3.2 shows the relative share of health care insurance types in general and for no visits, one and more than one physician visits in particular. The share of individuals with private insurance increases from 59.00% for no physician visits to 78.87% for more than one physician visit, which is a relative increase of 33.68%. A similar pattern is visible for the publicly insured: the increase in their share from zero to more than one visit is 39.27%. In contrast, the share of individuals without insurance shows a relative decrease of 79.1% from no physician visits to more than one physician visit. Privately and publicly insured individuals thus visit the physician relatively more often than individuals without insurance. These descriptive statistics should, however, be interpreted with caution since they are not related to explanatory variables.

Table 3.2: Share of health care insurance types per number of physician visits

Average No visits One visit More than one visit

Private insurance 69.89% 59.00% 74.43% 78.87%

Public insurance 12.70% 10.62% 12.71% 14.79%

No insurance 17.44% 30.38% 12.85% 6.35%

Share of health care insurance types per number of physician visits (no visits, one visit or more than one visit) in 2014.

This chapter introduced the MEPS 2014 data set, constructed explanatory variables and re-viewed some descriptive statistics. Although the data set is fairly rich, a limitation is that no information on deductibles and co-payments is available. It is therefore not possible to dis-tinguish between different insurance coverage contracts, which should be kept in mind when interpreting the results of this study. However, the inclusion of deductibles and co-payments would introduce additional endogeneity and as a result would substantially complicate the model discussed in Chapter 4. The next chapter discusses several count data models and presents the empirical methodology.

(18)

Chapter 4

Model and methodology

As outlined in Chapter 2, models confined to either endogenous treatment or endogenous par-ticipation have been a subject of extensive research in health economics, however less often both problems are jointly taken into consideration. This study focuses on investigating the effect of health insurance coverage on the number of physician visits, while taking into account both the possible endogeneity of the treatment (the private health insurance coverage) as well as endogenous participation. For this purpose, the econometric techniques in this study are pre-dominantly based on an estimation method proposed by Bratti and Miranda (2011). Prior to elaborating in Section 4.2 on the estimation method applied in this study, Section 4.1 discusses possible estimation techniques for count data.

4.1

Count data models

The dependent variable of interest is the number of physician visits, which is a discrete outcome variable that can take numerous non-negative values. As a consequence of the fact that no upper-limit on the number of physician visits is present, the dependent variable is regarded as a count variable. Cameron and Trivedi (2005, p. 665) claim that a count data model is a suitable type of model for investigating the relation between the number of physician visits, health insurance coverage and other relevant explanatory variables. Several count data models, assuming that the dependent variable is generated by a discrete probability function, exist. A few basic models are, in conjunction with their advantages and limitations, discussed in this section.

Firstly, a natural assumption is that the count, the number of physician visits yi, follows

a Poisson distribution (Cameron and Trivedi, 2005, p. 666). The Poisson probability mass function is defined as:

Pr[Y = y] = e

−λλy

y! , y = 0, 1, 2, ..., (4.1)

where λ is the intensity parameter. A characteristic of the Poisson distribution is its assumption of equality of mean and variance of the distribution, which is called equidispersion. More specifically, the Poisson distribution implies that E[Y ] = V[Y ] = λ. To allow the dependent variable to be related to explanatory variables, and to ensure that expected counts are non-negative, an exponential mean parametrization is often utilized (Hidayat and Pokhrel, 2009):

λi = exp(x0iβ), i = 1, ..., N (4.2)

where xi is a vector of explanatory variables, β are parameters to be estimated and N denotes

the total number of observations. Under the assumption that the observations are independent, estimation of the parameters is often by maximum likelihood. The subsequent paragraphs

(19)

CHAPTER 4. MODEL AND METHODOLOGY 16

discuss some advantages and limitations of the Poisson distribution.

The main advantage of using the Poisson distribution is its quasi-maximum likelihood (QML) property. This property implies that, regardless of misspecification of the distribution of the error term or unobserved heterogeneity, the expectation of the count variable can be consistently estimated. More specifically, the maximum likelihood estimator is consistent if the mean is correctly specified (i.e., E[yi|xi] = exp(x0iβ)) and if the true distribution of the count belongs

to the linear exponential family of distributions (Cameron and Trivedi, 2005, p. 147). However, as a consequence of the fact that the information matrix equality does no longer hold in case of distributional misspecification, sandwich standard errors are required if the QML property is employed. Furthermore, it is to be noted that the QML property does not imply that estimated probabilities of the number of events are consistent. In order for these probabilities to be consistent, the count truly has to be Poisson distributed (Cameron and Trivedi, 2005, p. 670). An additional advantage of the Poisson count model is the global concavity of its log-likelihood function. This property ensures that iteratively solving of the first order conditions for maximum likelihood leads to unique parameter estimates (Cameron and Trivedi, 2005, p. 668). Despite the several advantages, the Poisson count model is criticized for a number of reasons. Firstly, among others, Cameron et al. (1988) claim that the Poisson model is restrictive due to the underlying assumption that events are independent over time. One health condition however might lead to multiple uses of health services, i.e., one doctor visit might increase the likelihood of subsequent doctor visits. This is, however, perhaps less of an issue for the number of physician visits: physicians often refer individuals suffering from a serious health condition to specialists. Consequently, physician visits are more or less independent. Moreover, based on the study of Kemp (1967), Cameron et al. (1988) claim that the use of the negative binomial distribution (discussed below) is justified in their study of dependence between health insurance and utilization as it can provide a proper fit to spell-generated data.

Although the Poisson model captures the general feature of count data of an increasing variance with the mean, another essential disadvantage of the Poisson model is that it requires equality of mean and variance. For count data however, the variance often exceeds the mean (Cameron and Trivedi, 2005, p. 670) and the variation around the predicted values is larger than would be consistent with the Poisson distribution. This so-called overdispersion may be a consequence of unobserved heterogeneity or different data generating processes for the first event and later events. Neglecting overdispersion results in inefficient estimates (Cameron et al., 1988), parameter estimates however remain consistent provided correct specification of the conditional mean due to the QML property. The sample mean of the number of physician visits in this study is 2.38, whereas the sample variance equals 18.06. Since these simple descriptive statistics are not related to regressors (i.e., are not based on a model), they do not provide evidence for the presence of overdispersion. Yet, it suggests that imposing equidispersion might be too restrictive in this study.

(20)

CHAPTER 4. MODEL AND METHODOLOGY 17

regression often underpredicts the proportion of zeros in the data (Cameron and Trivedi, 2005, p. 670). The remainder of this section briefly discusses three alternative count data models that accommodate for overdispersion and excess zeros.

Firstly, the Negative Binomial model assumes that, conditional on the intensity parameter µ, counts are generated by a Poisson process. In contrast to the Poisson model where the parameter µ is assumed to be deterministic, the Negative Binomial model allows µ to be random (Hausman et al., 1984). More specifically, µ = λη, where λ is a deterministic function of the covariates and η > 0 presents unobserved heterogeneity and is considered to be a gamma distributed random error. The Negative Binomial model can thus be regarded as a Poisson count with unobserved heterogeneity. An advantage of the Negative Binomial model over the Poisson model is that the former exhibits overdispersion whereas the latter requires equidispersion. In respect of this facet, Negative Binomial models are thus less restrictive than Poisson models. However, it should be remarked that there is no statistical justification for the gamma distribution of the unobserved heterogeneity η (Riphahn et al., 2003) and some arbitrariness is therefore involved. A second alternative to the general Poisson model is the hurdle model that has been briefly discussed in Section 2.3. Recall that the hurdle model models a two-stage decision making process: a binary logit or probit model is specified for the decision to utilize health care, whereas a truncated count data model (such as Poisson of Negative Binomial) is applied to the positive counts (Winkelmann, 2004). The hurdle model thus allows the zeros and positive values to be generated by different data processes (Cameron and Trivedi, 2005, pp. 680-681). Therefore, this model has the advantage over the Poisson count model that it is able to deal with the excess zeros problem mentioned previously and the endogenous participation. In addition, it breaks the equidispersion and estimation is still straightforward as the log-likelihood consists of two separate parts. Moreover, it is to be noted that employing a regular Poisson model while the zero and positive values should be allowed to be generated by different data processes yields inconsistent estimates (Cameron and Trivedi, 2005). A drawback of the hurdle model is however that the model is highly parametrized: the number of parameters is generally doubled compared to the Poisson model. Furthermore, recall that the hurdle model assumes that the contact and level decision are uncorrelated. See Sections 16.4 and 20.4.5 of Cameron and Trivedi (2005) for a more detailed description of the hurdle model.

Finally, another model which is able to handle an excessive number of zero counts is the zero-inflated model. This models augments a Poisson or Negative Binomial density f2(·) that

gener-ates counts with a binary outcome model such as logit or probit with density f1(·) (Cameron and

Trivedi, 2005). The fundamental idea underlying this model is thus similar to that of the hurdle model. However, the models differ in the sense that the hurdle model only allows the binary process to yield zero outcomes, whereas both processes can generate zeros in the zero-inflated model. The density g(y) of the zero-inflated model can be specified as:

g(y) = (

f1(0) + 1 − f1(0)f2(0) if y = 0

1 − f1(0)f2(y) if y > 0

(21)

CHAPTER 4. MODEL AND METHODOLOGY 18

According to Rose et al. (2006), it depends on the application whether the hurdle or zero-inflated model is more convenient. The hurdle model is applied more frequently than the zero-inflated model in the econometric literature (Cameron and Trivedi, 2005).

This section considered four count data models. The Negative Binomial, hurdle and zero-inflated model have appealing advantages over the Poisson model, yet are not optimal for the purpose of this study. The next section continues with the empirical method.

4.2

Empirical methodology

After discussing some possible models for count data, this section presents the empirical method-ology. Recall that the method is predominantly based on a model proposed by Bratti and Miranda (2011), who use maximum simulated likelihood to introduce an estimator for models where a count variable is affected by an endogenous binary treatment and in addition account for either sample selection or endogenous participation. The treatment of interest in this study is whether a respondent is privately insured and the endogenous participation relates to the par-ticipation into a physician visit and the intensity of physician visits. A three-equations model is defined, consisting of equations for the treatment, the endogenous participation and the number of physician visits conditional on utilization. In contrast to Bratti and Miranda (2011), this study avoids the use of maximum simulated likelihood and the associated simulation bias by numerically integrating out unobserved heterogeneity.

4.2.1 Endogenous treatment and endogenous participation equations

Firstly, binary outcome models are specified for the endogenous treatment and endogenous participation. Regarding endogenous treatment, a distinction is made between two types of individuals: those who have private insurance (the treated), and those who have not. Let Ii be

a dummy variable that equals one if individual i is privately insured and is zero otherwise. Notice that this dummy variable is endogenous if both Ii and the number of physician visits yi are

driven by some unobservable individual characteristics or if the two variables are co-determined. To allow the participation into a physician visit and the intensity of physician visits to be generated by two different data processes (two-part decision making), a hurdle set-up as dis-cussed in Section 2.3 is employed. This subsection considers the initial (participation) decision and the next subsection discusses the level decision. A participation dummy variable Pi is

cre-ated for the decision to visit a physician for the first time. In particular, Pi has value one if

individual i visited the physician at least once in 2014, and is zero if the individual did not visit the physician at all in 2014. Assuming that individuals are privately insured with probability p1i and visit the physician at least once with probability p2i, the dummy variables Ii and Pi are

defined as: Ii= ( 1 with probability p1i 0 with probability 1 − p1i , Pi = ( 1 with probability p2i 0 with probability 1 − p2i .

(22)

CHAPTER 4. MODEL AND METHODOLOGY 19

In order to form a regression model, the probabilities p1i and p2i are parametrized by using

latent (unobserved) variables and as a result vary across individuals (Cameron and Trivedi, 2005, p. 466). More specifically, the binary variables Iiand Piare observed, while the underlying

continuous random variables Ii∗ and Pi∗ are not. The binary variables are assumed to be driven by the latent variables: they take values 0 and 1 depending on whether the latent variables cross a zero-threshold. The continuous latent variable models and the relation between the latent and binary variables are summarized as follows:

I∗ = α1+ z0γ + ν I = 1(I∗> 0), (4.4)

P∗ = α2+ r0θ + ϕI + q P = 1(P∗ > 0), (4.5)

where α1 and α2 represent constant terms, z and r denote vectors of explanatory variables

with γ and θ the corresponding coefficient vectors, ϕ is the coefficient of the private insurance coverage variable I in the participation equation, ν and q are error terms and 1(·) is an indicator function.

4.2.2 Count equation

Secondly, a model for the count variable is constructed. Conditional on P = 1, the number of physician visits y is assumed to be generated by a zero-truncated Poisson distribution and its conditional cumulative distribution function is specified as

G(y|η) ≡ Pr(y|η) = ( not defined if P = 0 µye−µ y![1−e−µ] if P = 1 , (4.6) with y = ( 0 if P = 0 1, 2, ... if P = 1 . (4.7)

µ is the conditional mean of y, i.e., µ ≡ E[y|x, I, η], with η a random variable representing un-observed individual heterogeneity. The unun-observed heterogeneity η embodies individual specific differences that are not fully accounted for by the observed explanatory variables (Cameron and Trivedi, 2005, p. 611). As illustrated in the next subsection, the unobserved heterogeneity is used to address the endogeneity of private insurance and endogenous participation. To ensure that expectations of the count are non-negative, the conditional mean of y is defined as

µ = ex0β+δI+η, (4.8)

where x denotes a vector of explanatory variables with corresponding coefficient vector β and δ is the coefficient of private insurance in the count equation.

(23)

CHAPTER 4. MODEL AND METHODOLOGY 20

4.2.3 Three-equation model

Combining the above equations leads to a three-equations model:        Ii∗= α1+ z0iγ + νi Pi∗ = α2+ r0iθ + ϕIi+ qi yi ∼ Poisson(truncated)(µi) . (4.9)

Again, it is emphasized that the two stages of the decision making process in respect of physician visits are allowed to differ: the contact decision is modelled by a binary outcome model, whereas the level decision conditional on positive usage is modelled by a zero-truncated Poisson model. To take into consideration the endogeneity of private insurance and endogenous participa-tion, the three equations are connected via a common element; the unobserved heterogeneity. In particular, to allow for correlation between the health insurance dummy, the participation dummy and the number of physician visits, ν and q are assumed to be defined as:

ν = λ1η + ζ, (4.10)

q = λ2η + ξ, (4.11)

where ζ and ξ denote idiosyncratic error terms and λ1 and λ2 are parameters to be estimated.

The resulting correlations are as follows:

ρη,ν= cov(η, λ1η + ζ)pV ar(η)V ar(λ1η + ζ) = λ1σ2η

q σ2

η(λ21σ2η+ 1), (4.12)

ρη,q= cov(η, λ2η + ξ)pV ar(η)V ar(λ2η + ξ) = λ2σ2η

q σ2

η(λ22σ2η+ 1), (4.13)

ρν,q= cov(λ1η + ζ, λ2η + ξ)pV ar(λ1η + ζ)V ar(λ2η + ξ) = λ1λ2ση2

q (λ2

1ση2+ 1)(λ22σ2η+ 1).

(4.14) It is to be noted that private insurance coverage and participation are exogenous with respect to the number of physician visits if ρη,ν = 0 (λ1 = 0 and/or ση2 = 0) and ρη,q = 0 (λ2 = 0

and/or ση2= 0), respectively.

As discussed in Section 4.1 the main limitations of the Poisson model are its rather restrictive requirement of equal mean and variance and underprediction of the fraction of zeros in the data. These limitations are overcome by the set-up applied in this study, which justifies the use of the Poisson model (Bratti and Miranda, 2011). Firstly, in the presence of unobserved heterogeneity, i.e. if σ2η 6= 0, the number of physician visits is allowed to display overdispersion. Secondly, apart from allowing the contact and level decision to be generated by different DGPs, the hurdle set-up has the additional advantage that it enables to address the excess zero problem. It should be emphasized that the hurdle-set up applied in this study is slightly different than the usual hurdle model: the usual hurdle model assumes independence between the contact and level processes, whereas the method applied in this study allows these processes to be related.

For estimation purposes, some distributional conditions are required. We refer to Bratti and Miranda (2011, pp. 1093-1094) for these conditions. Furthermore, identification of the parameters is assured by the covariance matrix restrictions and the functional (non-linear) form of the model. The vectors z, r and x are therefore allowed to consist of exactly the same

(24)

CHAPTER 4. MODEL AND METHODOLOGY 21

explanatory variables. Nonetheless, some exclusion restrictions are imposed in order to enhance identification and let it not be solely dependent on non-linearities. To determine the restrictions, probit models for private insurance and participation and a general Poisson model for y that include all explanatory variables discussed in Section 3.2 are estimated first. The non-significant explanatory variables are excluded from the relevant equations and the resulting equations are then used to estimate the model outlined in this subsection.

4.2.4 Estimation method

The model is estimated by maximum likelihood and for inference on the parameter estimates, the marginal distribution of y (pertaining to µ) is required (Cameron and Trivedi, 2005). An estimable distribution is therefore acquired by assuming that the unobserved heterogeneity η has a known distribution and integrating η out. Two frequently applied distributions for η are the Gamma and the normal distribution. Let h(y|η) be the density of y given η and f (η) represent the known distribution function of η. Equation (4.15) then illustrates how to obtain the marginal distribution of y in case η is assumed to be normally distributed.

h(y) = Z ∞

−∞

h(y|η)f (η)dη. (4.15)

The assumption that η is Gamma distributed leads to the Negative Binomial model. According to Winkelmann (2004), the appeal of this assumption is that it yields an explicit solution for the probability function. In contrast, the assumption of the normal distribution does not lead to a closed-form probability function and therefore complicates estimation. However, several studies find that this assumption yields an improved fit compared to the Gamma distribution (Winkelmann, 2004). Based on this argument, Winkelmann (2004) estimates a hurdle model with normal distributed unobserved heterogeneity in his study of the impact of the 1997 German health care reform on the number of doctor visits. Bratti and Miranda (2011) also assume η to be normally distributed and following them, this study assumes that η ∼ N (0, ση2). The density function φ(η) is then: φ(η) =  2πσ2η −1/2 · e−η2/2ση2. (4.16)

Moreover, ζ|η and ξ|η are assumed to be independently distributed normal errors with zero mean and unit variance, i.e. equations (4.3) and (4.4) represent probit models with unobserved heterogeneity. Let Pr[I = τ |η] and Pr[P = τ |η] be the conditional probabilities of I = τ and P = τ , τ ∈ (0, 1) given η. The probabilities to be privately insured and participate in at least one physician visit can then be specified as:

Pr[I = 1|z, η] = Pr[I∗> 0|z, η] = Pr[α1+ z0γ + ν > 0|z, η] = Φ(α1+ z0γ + λ1η), (4.17)

Pr[P = 1|r, η] = Pr[P∗> 0|r, η] = Pr[α2+ r0θ + q > 0|r, η] = Φ(α2+ r0θ + λ2η), (4.18)

where Φ(·) denotes the cumulative distribution function of the standard normal density. Four situations are possible: (non-treated, non-participant), (treated, non-participant), (non-treated, participant), (treated, participant), and the log-likelihood thus consists of four parts. As the

(25)

CHAPTER 4. MODEL AND METHODOLOGY 22

number of physician visits is only observed for the latter two cases, only in these cases G(y|η) is incorporated in the likelihood. The log-likelihood can be specified as:

` = N X i=1 ( (1 − Ii)(1 − Pi) log Z Pr[I = 0|η] · Pr[P = 0|η] · φ(η)dη  + Ii(1 − Pi) log Z Pr[I = 1|η] · Pr[P = 0|η] · φ(η)dη  + (1 − Ii)Pilog Z Pr[I = 0|η] · Pr[P = 1|η] · G(y|η) · φ(η)dη  + IiPilog Z Pr[I = 1|η] · Pr[P = 1|η] · G(y|η) · φ(η)dη ) . (4.19)

Based on the fact that analytical integration provides no closed-form solutions for the inte-grals in the log-likelihood function above, Bratti and Miranda (2011) argue that these inteinte-grals must be evaluated numerically by using maximum simulated likelihood in order to estimate the parameters. See Section 12.4 of Cameron and Trivedi (2005) for an explanation of this method. An alternative method to obtain parameter estimates is to integrate η out by numerical inte-gration, which is applied in this study. Recall that the numerical integration has the advantage that it, as opposed to simulated maximum likelihood, avoids simulation bias. The log-likelihood function is then maximized with respect to α1, α2, γ, θ, ϕ, β, δ, λ1, λ2 and ση and the estimation

(26)

Chapter 5

Results

This chapter presents the estimation results of several types of models. The first model is a simple Poisson model without unobserved heterogeneity, neglecting the possible presence of endogeneity of private insurance and two-part decision making. The second model (ET-Poisson) addresses the possible endogeneity of private insurance, however does not allow for two different DGPs for the contact and level decision. In contrast, the third model (EP-Poisson) does take into account the endogenous participation, but ignores the possible endogeneity of private insurance. The fourth model (EPET-Poisson) corresponds to the ’full’ model described in Section 4.2 and takes into consideration both possible endogeneity of private insurance and endogenous participation. The aim of estimating these various models is to be able to investigate whether endogeneity of private insurance and endogenous participation significantly influence the estimation results and hence whether addressing these issues is necessary. Notice that the EP-Poisson is nested within the full model, whereas the regular Poisson and ET-Poisson model are not. The log-likelihood of the first three models are provided in Appendix III.

The estimated coefficients and standard errors of the regular Poisson, ET-, EP-, and EPET-Poisson are presented in Table 5.1. The columns y and y > 0 represent the estimation results of the (truncated-)Poisson models, whereas the columns Pr(y > 0) represent the estimation results of the probit models for participation. Robust sandwich standard errors are reported for the (truncated-)Poisson models (Cameron and Trivedi, 2005, p. 669). Using robust sandwich standard errors only makes sense if the coefficient estimators remain consistent when sandwich errors are required (Cameron and Trivedi, 2005, p. 274) and yields no benefits for binary out-come models provided that observations are independent (Cameron and Trivedi, 2005, p. 469). Therefore, usual ’non-robust’ standard errors are reported for the probit equations. As the esti-mated coefficients are not directly interpretable, marginal effects are computed. These marginal effects depend both on the regression coefficients and regressors (Cameron and Trivedi, 2005, p. 122). As a consequence, the reported marginal effects in Table 5.1 are averaged for all individuals.1

5.1

Unobserved heterogeneity and correlation among error terms

This section discusses the results regarding the unobserved heterogeneity η and the correlation between the error terms of the different equations. ˆση is significantly different from zero at the

1% level for all models that account for unobserved heterogeneity (i.e., for all models except for

1Bootstrap standard errors for the estimated coefficients as well as the marginal effects and correlations are

not employed due to time constraints: bootstrapping a model that is as computer-intensive as the model applied in this study would take too much time.

Referenties

GERELATEERDE DOCUMENTEN

students with no out of school contact, studying in a private institution where the classroom language is only English, have been compared to 4 groups of Dutch

A Monte Carlo comparison with the HLIM, HFUL and SJEF estimators shows that the BLIM estimator gives the smallest median bias only in case of small number of instruments

Finally, a number of less prevalent safety risks (abuse of the guest system, joint use of shoot- ing ranges, inadequate supervision of recreational shooters) and the storage of

If the publication is distributed under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license above, please follow below link for the End

Bacteriocins produced by lactic acid bacteria, in particular, are attracting increasing attention as preservatives in the food processing industry to control undesirable

However, the other actor misuses the trust early at the collaboration or even at the very beginning and in this special case the financial loss of the trustor actor would be much

Op basis van mijn onderzoek beveel ik aan om leerlingen meer kennis te laten maken met open gespreksopdrachten met een informatiekloof waarbij een opdracht moet worden gemaakt,