Confidence intervals for dummy percentage effects in loglinear regression models

(1)

Confidence intervals for dummy

percentage effects in loglinear regression

models

Tim Gunneweg

Supervisor: dhr. Dr. K.J. (Kees Jan) van Garderen

ABSTRACT

This paper considers confidence intervals of the percentage effect of

dummy variables in semilogarithmic regression models. First, a method

to construct exact confidence intervals in such models is introduced.

Next, this method is tested by comparing it with normal approximated

confidence intervals using Monte Carlo simulation. From the simulation

experiment it is concluded that the normal approximated confidence

intervals are misleading for small samples due to the non-‐normality of

the finite sample distribution of the percentage estimator. Furthermore,

two adjustments to the new technique, that can be used in the presence

of heteroskedasticity, are discussed and tested: a method based on FGLS

results and a method based on heteroskedastic consistent estimators.

The latter method preforms better in terms of coverage probability for

different model parameters, but the first might be improved using a

correction for small samples.

(2)

In economic literature, log transformations are often used to test the percentage impact of dummy regressors on a dependent variable. For example, Immergluck (2008) studies the effect of different financial regulators on investments of American banks in housing and community development. In the United States, banks are encouraged to invest in housing and community development by the Community Reinvestment Act (CRA). This is a piece of legislation that makes it possible for financial regulators to base their approval of certain banking activities partly on the value of a bank’s investment in housing and community development. In the United States, there are four different regulators for

different types of banks. Immergluck (2008) estimates a linear regression model with the log of a bank’s CRA-‐qualified investments as dependent variable. As independent variables he uses various control variables together with a dummy for the regulators. Immergluck (2008) concludes that the effect of two regulators is significant, based on the significance of the OLS estimate of the dummy

coefficient. Furthermore, he finds the magnitudes of the three dummies to be 183%, 112% and 82% respectively, with the last one not being significant.

While concluding that the percentage effect is present and making a claim about its size, Immergluck (2008) does not provide confidence intervals of the percentage effect to show the concentration of the magnitude. Although not necessary to make conclusions about the presence of a percentage effect,

confidence intervals could be very useful to gain more insight in the spread of an effect. However, in the special case of a percentage effect of a dummy variable, the construction of confidence intervals is not as straight forward as it is for continuous variable because of the binary features of a dummy variable.

The characteristics of the percentage effect of dummy variables have been widely studied. Contrary to a continuous variable, the estimated coefficient of a dummy variable in a log linear model cannot be interpreted as the percentage effect of that variable on the independent variable. In contrast, showed by Garderen and Shah (2002), the percentage effect of a dummy variable should be estimated using the approximately unbiased Kennedy estimator, which is a function of the OLS estimate of the dummy coefficient and the OLS estimate of its variance. Furthermore they argue that this Kennedy estimator should be used together with an approximately unbiased estimator of its variance to measure its

(4)

spread. Nevertheless, they do not elaborate on how this spread should be interpreted exactly. These statistics could be used to estimate normal

approximations of the confidence intervals of the percentage effect when the distribution of the Kennedy estimator is close to normal. To examine the

reliability of these estimates, Giles (2011) formulates an expression for the finite sample density function of the Kennedy estimator and concludes that it is far from normal and therefore bootstrap methods should be used to estimate confidence intervals. However, the provided density function is incorrect as van Garderen points out (personal communication, May 9, 2014). Furthermore it is not clear whether bootstrap methods are optimal under these conditions.

The main goal of this study is to develop and test a method to construct exact confidence intervals for dummy variables in log linear models. The

resulting intervals are compared with confidence intervals based on the normal approximation of the percentage estimator.

The method to construct confidence intervals is examined in three ways. First, a technique to construct confidence intervals under perfect model

assumptions is derived theoretically. Next, Monte Carlo simulation is used to compare confidence intervals resulting from this technique with confidence intervals based on the normal approximation of the Kennedy estimator. Furthermore, it is demonstrated that these techniques and the normal

approximation technique can lead to different results. Finally, the implications of heteroskedasticity are discussed and two methods to solve this issue are

compared using Monte Carlo simulation.

This paper is organized as follows. In the second section, the theoretical framework is developed by formulating the considered model, explaining different estimators of the percentage effect, and investigating the normal approximation based on the Kennedy estimator. In the third section, a method for constructing confidence intervals is derived and explained. In the fourth section, the first simulation experiment is explained and its most important results are discussed. In the fifth section, the implications of heteroskedasticity are discussed and two methods to solve this issue are compared using Monte Carlo simulation. Finally, in the sixth section, some concluding remarks are made.

(5)

2. Theoretical Framework

In the following section, the necessary theoretical framework is established. First, the considered model is specified. Thereafter, two different estimators of the percentage effect are discussed: the minimum variance unbiased estimator and the Kennedy estimator. Finally, the normal approximation of confidence intervals of the Kennedy estimator is provided.

The model

The considered model can be specified as follows: 𝑌 = exp {𝑎 + 𝑏_!𝑋_! ! !!! + 𝑐_!𝐷_! ! !!! + 𝜀}

where exp . is defined as the element wise exponential function, the 𝑋_!‘s are continuous variables, the 𝐷_!’s are dummy variables and 𝜀 ~ 𝑁(0, 𝜎!_𝐼

!). After taking the element wise log at both sides of the model equation, the model becomes linear. 𝐿𝑜𝑔 𝑌 = 𝑎 + 𝑏!𝑋! ! !!! + 𝑐!𝐷! ! !!! + 𝜀 (2.1)

Since the model is linear after the transformation, the optimal estimators of the coefficients can be obtained using OLS. In the continuous case, the coefficients of the resulting linear model multiplied by 100, can be interpreted as the

percentage effect of the independent variable on the dependent variable. To see why, differentiate both sides of (2.1) with respect to 𝑋_! to obtain

𝑝_! = 100 ∗1 𝑌 𝜕𝑌 𝜕𝑋_! = 100 𝜕𝑙𝑛𝑌 𝜕𝑋_! = 100𝑏!

In order to simplify notation, from now 𝑝_! is defined as the relative change, without the factor 100. So in this case 𝑝_! = !_!_!!!"

! = !"#$

!!! = 𝑏!.

For dummy variables, this does not hold since a dummy variable 𝐷! is binary and hence no continuous derivative of Y with respect to 𝐷_! exists. The

(6)

percentage change 𝑝_! of Y, from 𝑌_! to 𝑌_!, resulting from the change of 𝐷_! from 0 to 1, should be calculated directly using 𝑝_! = !!!!!

!! (van Garderen & Shah, 2002). Using (2.1), this leads to

𝑝_! = !!!!! !! = !! !!− 1 = !"# !! !!!!!!!!!!!∗!! ! !"# !! !!!!!!!!!!!∗!! ! − 1 = exp 𝑐! − 1 so 𝑝! = exp 𝑐! − 1. (2.2)

So an unbiased estimator of 𝑝_! should have this expectation.

Different estimators of 𝒑𝒋

Since in general c (from now on, the subscript j is dropped for clarity) is

unknown, (2.2) cannot be used directly to calculate p, but p has to be estimated using the OLS estimate 𝑐 of c. A simple, but wrong, solution to this problem, which is often used in literature (see references van Garderen and Shah, 2002), would be to replace 𝑐_! in (2.2) with its OLS estimate 𝑐_!. However, it is easy to see that this results in a biased estimator of p because:

𝐸 exp 𝑐 − 1 𝑋 > exp 𝐸 𝑐 𝑋 − 1 = exp 𝑐 − 1

where the inequality sign follows directly from Jensens inequality and the strict convexity of the exponential mapping.

Goldberger (1968) shows that the expected value of exp 𝑐 in fact equals exp 𝑐 +!_!𝑉(𝑐) , with 𝑉 𝑐 the variance of 𝑐. Therefore, Kennedy (1981) argued to use the following estimator of p:

𝑝 = exp 𝑐 −1₂𝑉 𝑐 − 1) (2.3)

where 𝑐 is the OLS estimate of c and 𝑉 𝑐 is the OLS estimate of its variance. In their study, van Garderen and Shah (2002) show that the Kennedy estimator is biased and that the minimum variance unbiased estimator of p equals:

(7)

𝑝 = exp {𝑐}_!𝐹_! 𝑚; −1

2𝑚𝑉 𝑐 − 1 (2.4)

where 𝑐 and 𝑉 𝑐 are OLS estimates, 𝑚 =!!!_! with n the number of observations and k the number of regressors and _!𝐹_! is the confluent hypergeometric

function (for an explanation of the used hypergeometric functions, see Appendix A). They also show that the variance of (2.4) equals:

𝑉 𝑝 = exp{2𝑐} exp 𝑉 𝑐 !𝐹! 𝑚; 𝑉 𝑐

! _{− 1}

Furthermore, they proof that the minimum variance unbiased estimator of 𝑉 𝑝 is

𝑉 𝑝 = exp{2𝑐} {[_!𝐹_! 𝑚; −1 2𝑚𝑉 𝑐

!

−_!𝐹_! 𝑚; −2𝑚𝑉 𝑐 }

In addition to the minimum variance unbiased estimator of the variance of 𝑝, van Garderen and Shah (2002) derive the following approximately unbiased

estimator of its variance:

𝑉 𝑝 = exp 2𝑐 [exp {−𝑉 𝑐) − exp −2𝑉 𝑐 ] (2.5)

In their research, they further show that the unbiased estimates of 𝑝 are very close to those calculated by the much more convenient Kennedy estimator. This leads them to suggest that, in most applications, Kennedy’s estimator should be used together with their approximately unbiased estimator of the variance (2.5) when estimating the percentage impact of a dummy variable.

Normal Approximated Confidence Intervals

Although arguing that (2.3) should be used to measure the size of a percentage effect and that (2.5) should be used to measure its variance, van Garderen and Shah do not explain how these two statistics should be used to construct precise confidence intervals of p. Using only a point estimator and an estimator of its variance, confidence intervals are usually approximated using the normal

distribution. So in this case, approximated equal tailed 1 − 𝛼 confidence intervals could be constructed using

(8)

𝑐! 𝑥, 𝑦 = 𝑝 − 𝑆𝑞𝑟𝑡 𝑉 𝑝 𝑧1

2𝛼 and 𝑐! 𝑥, 𝑦 =𝑝 + 𝑆𝑞𝑟𝑡 𝑉 𝑝 𝑧 1

2𝛼 (2.6) Although being asymptotically accurate, (2.6) could be misleading when the distribution of 𝑝 is very different from normal. In order to find a basis for inference about 𝑝, Giles (2011) derives the finite sample distribution of 𝑝, but apparently contains an error. Van Garderen (personal communication, May 9, 2014) shows that the pdf equals

𝑓 𝑝 = 1 2!!!! 𝜋 exp −σ!+ 𝐿𝑜𝑔 1 + 𝑝 ! 2𝑑𝜎! ∗ 1 + 𝑝 !!! !!!!_𝑣!_! _𝑑𝜎! !!!!_! _{𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑈[}𝑣 4; 1 2; −𝑐 + 𝑣 + 𝐿𝑜𝑔 1 + 𝑝 ! 2𝑑𝜎! ] (2.5)

where HypergeomU is Tricomi’s hypergeometric function. Figure 1 shows a plot of this density function with parameter values 𝜎! _{= 0.5, 𝑣 = 10, 𝑑 = 0.5,
and} 𝑐 = 0.25. From Figure 1 it is clear that the finite sample distribution of 𝑝 is far from normal, with the density function being positively skewed. Therefore, normal approximated confidence intervals could be misleading since they are shifted to the left compared to exact confidence intervals. To test the scale of this error, an exact method to find confidence intervals of 𝑝 is developed in the next section.

Figure 1: pdf of 𝒑 with 𝝈𝟐_{= 𝟎. 𝟓, 𝒗 = 𝟏𝟎, 𝒅 = 𝟎. 𝟓,
and 𝒄 = 𝟎. 𝟐𝟓} 3. Constructing Confidence Intervals

In this section, a method for the construction of exact confidence intervals is developed. The technique is based on confidence intervals for the coefficients in the transformed model using OLS estimates.

Consider the model as stated in (2.1). To distinguish between the stochastic random sample and an observed sample, first some notation is

(9)

introduced. X= (𝑋_!𝑋_!… 𝑋_!𝐷_!𝐷_!… 𝐷_!) and Y represent the stochastic random variables X and Y, before the sample is observed, and x and y represent observed outcomes of the random sample X and Y.

The goal is to find a two sided 1 − 𝛼 confidence interval 𝐼 of

𝑝_! = exp 𝑐_! − 1 based on a random sample with explanatory variables 𝑥 and dependent variable y. In this context, a two-‐sided 1 − 𝛼 confidence interval is defined by Bain and Engelhardt (1992, p. 360) as follows:

Definition 1. 𝐴𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 (𝑐! 𝑥, 𝑦 , 𝑐!(𝑥, 𝑦)) 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑎 100% (1 − 𝛼) 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝑝 𝑖𝑓 𝑃 𝐶_! 𝑋, 𝑌 < 𝑝 < 𝐶_! 𝑋, 𝑌 = 1 − 𝛼 𝑤ℎ𝑒𝑟𝑒 𝛼 ∈ 0,1 , 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒 𝑋 𝑎𝑛𝑑 𝑌 𝑎𝑛𝑑 𝑐! 𝑥, 𝑦 𝑎𝑛𝑑 𝑐! 𝑥, 𝑦 𝑎𝑟𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑥 𝑎𝑛𝑑 𝑦

Note that in definition 1, the statistics 𝐶_! 𝑋, 𝑌 and 𝐶_! 𝑋, 𝑌 are stochastic because they are functions of the random variables X and Y. On the other hand, 𝑐_! 𝑥, 𝑦 and 𝑐_! 𝑥, 𝑦 are observed values of these statistics in the case that the observed value of X is x and the observed value of Y is y.

In this context, confidence intervals can be constructed using Theorem 1.

Theorem 1.

Let 𝑐 be the OLS estimate of the coefficient of a dummy variable, D, in a loglinear regression model as specified in (2.1), and 𝑉(𝑐) the OLS estimate of its variance. An equal tailed, two sided 1 − 𝛼 confidence interval of the relative change, 𝑝, in Y due to D changing from 0 to 1 is given by (𝑐_! 𝑥, 𝑦 , 𝑐_! 𝑥, 𝑦 ) with

𝑐! 𝑥, 𝑦 = exp 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 𝑐_! 𝑥, 𝑦 = exp 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 (3.1)

where n is the sample size, k is the number of regressors and 𝑡!

!!,!!! is the t value with n-‐k degrees of freedom such that the probability to the right of it is 𝛼.

Proof.

Using (3.1) in Definition 1 gives 𝑃 𝐶_! 𝑋, 𝑌 < 𝑝 < 𝐶_! 𝑋, 𝑌 =

(10)

𝑃 exp 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 < 𝑝 < exp 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! − 1 = 𝑃 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! < 𝐿𝑜𝑔 𝑝 + 1 < 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! = 𝑃 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! < 𝑐 < 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! = 𝑃 −𝑡! !!,!!! < 𝑐 − 𝑐 𝑆𝑞𝑟𝑡 𝑉 𝑐 < 𝑡!!!,!!! = 1 − 𝛼

where the fourth equality follows from (2.2) and the last equality follows from the fact that _{!"#$ ! !}!!! ~𝑡 𝑛 − 𝑘 . The equality of the tails follows from the symmetry of the problem. This completes the proof.

4. Monte Carlo Simulation

Monte Carlo simulation is used to examine small sample properties of the exact confidence intervals (3.1) and their normal approximation (2.6). All simulations are done using Matlab and can be found in appendix B. The simulations are based on the model:

𝐿𝑜𝑔 𝑌_! = 𝑎 + 𝑏_!𝑋_!!+ 𝑏_!𝑋_!!+ 𝑐𝐷_! + 𝜀_!, where 𝜀_!~𝑖. 𝑖. 𝑑. 𝑁 0, 𝜎! _.

To cancel out the effect of specific values of the regressors, 𝑋! and 𝑋! are

regenerated each replication as standard normal variables. The first ½*n values of the variable D are set equal to 1 and the second ½*n values are set equal to zero. The initial values of the model parameters are set to 𝑎 = 1, 𝑏_! = 𝑏_! = 0.2, 𝑐 = 0.2231 𝑎𝑛𝑑 𝜎! _{= 0.25,
so
that
the
percentage
effect
of
𝑋}

! and 𝑋! equals 20% and the percentage effect of 𝐷_! equals 𝑝 = exp 𝑐 − 1 = 25%. Using (2.6) and (3.1), equal tailed 95% confidence intervals are calculated for different sample sizes using 10000 replications for each sample size.

Results

In Table 1, for both methods the average values of the confidence limits are reported together with their coverage probabilities. From Table 1, it is clear that the normal approximated confidence intervals are much wider than the exact confidence intervals, especially for small sample sizes. Furthermore, the normal approximated intervals are shifted downwards compared to the exact intervals. As a result of these two effects, the coverage probability of the normal

(11)

especially the case for small sample sizes. As the sample size increases, the normal approximation approaches the exact intervals, and thereby the coverage ratio increases. This result is illustrated by figures 1 and 2 were the confidence limits of the different methods and the coverage probabilities are plotted for different sample sizes. From these results, it is clear that normal approximated confidence intervals can be very misleading when sample sizes are small.

Table 1. Average confidence intervals and coverage probabilities for different sample sizes

n Exact confidence intervals Normal Approximation

cl cu CP cl cu CP 10 -‐0.416 2.380 0.950 -‐0.574 1.069 0.858 20 -‐0.214 1.118 0.952 -‐0.314 0.812 0.912 30 -‐0.134 0.877 0.951 -‐0.206 0.706 0.926 40 -‐0.087 0.758 0.949 -‐0.143 0.642 0.930 50 -‐0.054 0.687 0.950 -‐0.100 0.600 0.934 60 -‐0.029 0.640 0.951 -‐0.069 0.570 0.939 70 -‐0.010 0.604 0.950 -‐0.045 0.545 0.940 80 0.005 0.575 0.949 -‐0.026 0.525 0.940 90 0.018 0.553 0.952 -‐0.010 0.510 0.944 100 0.029 0.535 0.950 0.004 0.496 0.943 150 0.067 0.475 0.951 0.050 0.451 0.946 250 0.106 0.420 0.949 0.095 0.406 0.947 500 0.146 0.366 0.949 0.140 0.359 0.948 1000 0.175 0.331 0.950 0.173 0.328 0.949 5000 0.216 0.285 0.950 0.215 0.285 0.950 10000 0.226 0.275 0.949 0.226 0.275 0.949

Figure 2: confidence intervals of 𝒑 for different sample sizes

-‐1,0 -‐0,5 0,0 0,5 1,0 1,5 2,0 2,5 3,0 Value Sample Size cl cu Ncl Ncu

(12)

Figure 3: Coverage probabilities of both confidence intervals for different samples sizes

5. Heteroskedastic Error Terms

Although Theorem 1 gives an exact method to construct confidence intervals, it rests on quite strong assumptions about the data generating process. Especially the assumption about homoscedastic error terms is in practice often violated. In this section, the implications of heteroskedastic error terms are explored.

First, consider the model:

𝑦 = 𝐿𝑜𝑔 𝑌 = 𝑋𝑏 + 𝜀,

where b is a vector with dummy and continuous variables and 𝜀~𝑁(0, Ω) with known positive definite diagonal matrix Ω. The model can now be estimated using GLS. The model is transformed by multiplying both sides of the model equation with Ω!!!. Consequently :

Ω!!!𝑦 = Ω!!!𝑋𝑏 + Ω!!!𝜀,

where Ω!!!𝜀~𝑁 0, 𝐼_! . In this case, the estimated dummy coefficients are 𝑏_!"# = 𝑋!_Ω!!_𝑋 !!_𝑋′Ω!!_{𝑦
with
variance
𝑉𝑎𝑟 𝑏}

!"# = 𝑋!Ω!!𝑋 !!. Since these estimates result from a linear regression equation, the t-‐statistic follows a t-‐ distribution with n-‐k degrees of freedom. Therefore, Theorem 1 can be applied using the GLS estimates. This results in the following expressions for the confidence limits: 𝑐! 𝑥, 𝑦 = exp 𝑐!"# − 𝑆𝑞𝑟𝑡 𝑉 𝑐!"# ∗ 𝑡! !!,!!! − 1 (5.1) 0,80 0,82 0,84 0,86 0,88 0,90 0,92 0,94 0,96 Pr ob ab ili ty Sample Size Exact CP N App CP

(13)

𝑐_! 𝑥, 𝑦 = exp 𝑐_!"#+ 𝑆𝑞𝑟𝑡 𝑉 𝑐_!"# ∗ 𝑡!

!!,!!! − 1

However, in practice the structure of Ω is unknown, and consequently (5.1) cannot be used to construct confidence intervals. In the following section, two methods to solve this issue are discussed. The first method uses an estimate of the matrix Ω in combination with the results derived above. The second

method is based on a heteroskedastic consistent estimators of the variance of the dummy coefficient.

The first method uses two-‐step FGLS results to estimate the covariance matrix Ω, by estimating the following model of 𝜎_!!

𝜎_!! _{= exp {𝑧}

!!𝛾}, where 𝑧 = (1, 𝑧!… . 𝑧!) is an vector of explanatory variables (Heij et al, 2004, p,337). The exponential transformation is used to guarantee that estimated values of 𝜎_!!_{are
positive.
In
the
first
step,
OLS
is
applied
in
the} model 𝐿𝑜𝑔 𝑌 = 𝑋𝑏 + 𝜀. If 𝑏 is consistent, the squared residuals 𝑒_!!_{of
this} regression are asymptotic unbiased estimates of 𝜎_!!_{.
Therefore,
in
the
second} step, the following model is estimated to find the values of 𝛾

𝐿𝑜𝑔 𝑒_!! _{= 𝑧}

!!𝛾 + 𝜂!

The coefficients 𝛾_! are estimated consistently for 𝑖 = 2, … , 𝑝 but 𝛾_! should be estimated using 𝛾_!− 𝐸 log 𝜒! ₁ _{= 𝛾}

!+ 1.27 as Heij et al (2004, p337) point out. Finally, Ω is estimated by σ_!!! _{= exp 𝑧}

!!𝛾 . Using this estimate of Ω, (5.1) can be used to construct confidence intervals. The normal approximation (2.6) can be calculated using the FGLS estimate of the coefficient and the estimated covariance matrix in (2.3) and (2.5).

The second method uses White standard errors with a correction for small samples to estimate the variance of the coefficient estimator. This is a method to find hetroskedastic consistent estimators of the variance of 𝑏. The biggest advantage of this method is that it can be used when no model of the variance is known. Instead of using GLS, OLS is used to estimate the model parameters. Furthermore, the variance of the coefficients is estimated using

𝑣𝑎𝑟 𝑏 = 𝑋!_𝑋 !!_𝑋!_{𝑑𝑖𝑎𝑔} 𝑒!!

1 − ℎ_!! 𝑋 𝑋!𝑋 !! (5.2) where ℎ_!! is the 𝑖𝑖!!_{element
of
𝐻 = 𝑋 𝑋}!_{𝑋 𝑋′.
This
estimator
of
the
variance
can} be used in (5.1) together with the OLS estimate of the dummy coefficients to

(14)

construct confidence intervals. The normal approximation (2.6) can be calculated using the OLS coefficient together with (5.2) in (2.3) and (2.5).

Monte Carlo simulation

Monte Carlo simulation is used to compare the techniques developed in previous section. Simulation is based on the same model as before, but now with

𝜀!~𝑁(0, 𝜎!!𝐷! + 𝜎!! 1 − 𝐷! ) thus the variance of 𝜀! depending on the value of 𝐷!. This specification of the heteroskedasticity is used because it is logical to assume that the variance is different for different groups within the sample. Again, to cancel out the effect of specific values of the regressors, 𝑋_! and 𝑋_! are

regenerated each replication as standard normal variables. The first ½*n values of the variable D are set equal to 1 and the second ½*n values are set equal to zero. The initial values of the model parameters are set to 𝑎 = 1, 𝑏_! = 𝑏_! = 0.2, 𝑐 = 0.2231 𝑎𝑛𝑑 𝜎_!! _{= 0.25 𝑎𝑛𝑑 𝜎}

!! = 0.64, so that the percentage effect of 𝑋! and 𝑋! equals 20% and the percentage effect of 𝐷! equals 𝑝 = exp 𝑐 − 1 = 25%. Using both techniques, equal tailed 95% confidence intervals are calculated for different sample sizes, calculating 10000 replications for each sample size. In the GLS method, the covariance matrix is estimated using the dummy variable and a constant as regressors, thus 𝑧_! = 1, 𝐷_! !_.

Results

In Tabel 2 the average confidence intervals and coverage probabilities for both methods and their normal approximations are reported. It is clear that both methods outperform their normal approximations significantly. Again, this effect vanishes as the sample size increases. Furthermore, the coverage probability of the White method is larger than that of the GLS method for all sample sizes, with the difference being smaller for larger samples. This is illustrated by figure 4, where the coverage probability of both methods is plotted for different sample sizes.

(15)

Table 2: Confidence intervals and coverage probabilities for 𝝈_𝟏𝟐_{= 𝟎. 𝟐𝟓 𝒂𝒏𝒅 𝝈} 𝟐

𝟐_{= 𝟎. 𝟔𝟒}

n White Normal approximation White GLS Normal approximation GLS cl cu CP cl cu CP cl cu CP cl cu CP

10 -‐0.502 4.154 0.935 -‐0.769 1.288 0.814 -‐0.459 3.632 0.911 -‐0.695 1.256 0.795 20 -‐0.316 1.526 0.944 -‐0.486 0.961 0.886 -‐0.298 1.544 0.927 -‐0.467 0.974 0.874 30 -‐0.235 1.149 0.948 -‐0.360 0.834 0.902 -‐0.222 1.147 0.940 -‐0.346 0.835 0.903 50 -‐0.137 0.869 0.946 -‐0.218 0.709 0.917 -‐0.132 0.868 0.943 -‐0.213 0.708 0.921 100 -‐0.033 0.648 0.947 -‐0.077 0.579 0.935 -‐0.030 0.651 0.947 -‐0.074 0.581 0.936 250 0.062 0.482 0.952 0.044 0.457 0.946 0.062 0.481 0.950 0.043 0.456 0.947 500 0.113 0.408 0.949 0.103 0.396 0.947 0.116 0.412 0.949 0.106 0.400 0.948

Figure 4: Confidence intervals and coverage probabilities for 𝝈_𝟏𝟐_{= 𝟎. 𝟐𝟓 𝒂𝒏𝒅 𝝈} 𝟐

𝟐_{= 𝟎. 𝟔𝟒}

Figure 5: Coverage probabilities for 𝝈𝟏𝟐= 𝟎. 𝟏 𝒂𝒏𝒅 𝝈𝟐𝟐= 𝟎. 𝟗 0,89 0,90 0,91 0,92 0,93 0,94 0,95 0,96 10 20 30 50 100 250 500 Co ve ra ge P ro b ab il it y Sample Size White CP GLS CP 0,87 0,88 0,89 0,90 0,91 0,92 0,93 0,94 0,95 0,96 10 20 30 50 100 250 500 Co ve ra ge P ro b ab il it y Sample Size White CP GLS CP

(16)

To test the effect of the size of the heteroskedasticity, the simulation experiment is repeated with 𝜎_!! _{= 0.1 𝑎𝑛𝑑 𝜎}

!! = 0.9. The results are summarized in Table 3. Again, the coverage probability of the White confidence intervals outperforms the coverage ratio of the GLS confidence intervals. Furthermore, the normal approximations are unreliable in terms of coverage probabilities. These results indicate that the White confidence intervals preform better for different magnitudes of heteroskedsticity.

Table 3: Coverage probabilities for 𝝈_𝟏𝟐_{= 𝟎. 𝟏 𝒂𝒏𝒅 𝝈} 𝟐 𝟐_{= 𝟎. 𝟗}

n Exact White Normal approximation White Exact GLS Normal approximation GLS cl cu CP cl cu CP cl cu CP cl cu CP

10 -‐0.515 4.995 0.928 -‐0.803 1.309 0.795 -‐0.468 3.774 0.902 -‐0.713 1.259 0.776 20 -‐0.327 1.687 0.937 -‐0.519 1.020 0.879 -‐0.308 1.623 0.922 -‐0.488 0.993 0.871 30 -‐0.244 1.232 0.943 -‐0.383 0.875 0.906 -‐0.230 1.216 0.932 -‐0.365 0.866 0.899 50 -‐0.150 0.922 0.947 -‐0.241 0.740 0.925 -‐0.141 0.920 0.937 -‐0.230 0.739 0.917 100 -‐0.047 0.677 0.947 -‐0.096 0.598 0.937 -‐0.045 0.673 0.943 -‐0.094 0.595 0.930 250 0.053 0.498 0.949 0.031 0.469 0.942 0.052 0.494 0.946 0.031 0.466 0.941 500 0.106 0.418 0.950 0.095 0.404 0.948 0.107 0.419 0.949 0.096 0.405 0.948 6. Discussion

Log linear regression models are often used to model percentage effects in economic relations. For continuous variables, interpretation of the estimated coefficients follows from the differentiation of the model with respect to the corresponding variable. Due to their binary characteristics, the interpretation of dummy variables is not as straightforward since no continuous derivative with respect to a dummy exists. In recent studies, unbiased and approximately unbiased estimators of the percentage effect of dummy variables have been developed and tested together with unbiased and approximately unbiased estimators of their variance. In their research, van Garderen and Shah (2002) argue that the estimator provided by Kennedy’s (1985) could be used safely to estimate the size of a percentage effect. Furthermore, they derive a convenient approximately unbiased estimator of it variance. Although providing point estimates and measures of spread, none of the recent studies gives an exact method for the construction of confidence intervals of the percentage effect of dummy variables.

In this paper, an exact method to construct confidence intervals under perfect model assumptions is developed. Furthermore, two possible adjustments

(17)

that can be made in the case of heteroskedastic error terms are discussed: a method based on HC estimates of the variance of the coefficient and a method based on two step FWLS. Using Monte Carlo simulation, all methods are tested together with normal approximations based on Kennedy’s (1981) estimator of the percentage effect and van Garderen and Shah’s (2002) approximately unbiased estimator of its variance.

From the simulation experiment, it is clear that small sample confidence intervals based on the normal approximation can be misleading for two reasons: they are shifted to the left and they are much smaller than the exact intervals. Therefore, under classic model assumptions, the exact method should be preferred above the normal approximation when samples sizes are small.

In the case of heteroskedastic error terms, the method based on

heteroskedastic consistent estimates of the variance of the dummy coefficient, outperforms the method based on FGLS results in terms of coverage probability for all sample sizes and different magnitudes of heteroskedasticitiy. These results are counterintuitive since the FGLS method uses more information about the data generating process. However, one should be cautious drawing

conclusions from these seemingly strong results for the following reason. In the first step of the FGLS estimation, the possible model of the variance is estimated by replacing 𝜎_!!_{by
𝑒}

!! because 𝑒!! is an asymptotic unbiased estimate of 𝜎!!. Nevertheless, it is known that under classic model assumptions 𝐸[𝑒_!!_{] =} 𝜎! _{1 − ℎ}

!! where ℎ!! is the 𝑖!! diagonal element of 𝐻 = 𝑋 𝑋!𝑋 !!𝑋′. As a result, a small sample correction factor might be needed when estimating the value of 𝜎_!!_{.
Therefore,
more
research
is
needed
before
conclusions
about
best
responses} to heteroskedastic error terms can be drawn.

(18)

7. References

Bain, L. J., & Engelhardt, M. (1992). Introduction to probability and

mathematical statistics (Vol. 4). Belmont, CA: Duxbury Press.

van Garderen, K.J., & Shah, C. (2002). Exact interpretation of dummy

variables in semilogarithmic equations. The Econometrics Journal, 5(1), 149-‐159. Giles, D. E. (2011). Interpreting dummy variables in semi-‐logarithmic

regression models: Exact distributional results (Working Paper No. 1101).

Department of Economics, University of Victoria.

Heij, C. de Boer, P., Franses, P.H., Kloek, T.,& van Dijk., H.K. (2004)

Econometric methods with applications in business and economics (Vol. 5). Oxford

University Press.

Immergluck, D. (2008). Out of the goodness of their hearts? Regulatory and regional impacts on bank investment in housing and community development in the United States. Journal of Urban Affairs, 30(1), 1-‐20.

Kennedy, P. E. (1981). Estimation with correctly interpreted dummy

variables in semilogarithmic equations [the interpretation of dummy variables in semilogarithmic equations]. American Economic Review, 71(4).

(19)

Appendix A Hypergeometric functions

The hypergeometric series _!𝐹_! 𝑎_!, … , 𝑎_!, 𝑏_!, … 𝑏_!; 𝑧 is defined as

_!𝐹_! 𝑎_!, … , 𝑎_!, 𝑏_!, … 𝑏_!; 𝑧 = 𝑎! !… 𝑎! ! 𝑏_{! !}… 𝑏_{! !} ! !!! 𝑧! 𝑛! , where 𝑥 _! represents the Pochhammer symbol defined as

𝑥 ! = 1 𝑎𝑛𝑑 𝑥 ! = 𝑥 𝑥 + 1 … (𝑥 + 𝑖 − 1)

Tricomi’s hypergeometric function, used in (2.5), can be defined in terms of hypergeometric series by

𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑈 𝑎, 𝑏, 𝑧 = 𝜋 csc 𝜋𝑏 !𝐹! 𝑎, 𝑏, 𝑧 Γ 𝑎 + 𝑏 − 1 − 𝑧!!! !𝐹! 𝑎 − 𝑏 + 1,2 − 𝑏 Γ 𝑎

Appendix B Matlab computations

Matlab code Monte Carlo simulation homoscedastic error terms tic % start time measure

% Set initail values for main parameters:

nobs = 10; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma2 = 1/2

% Set initial values for model elements:

b = 2/10*ones(nvar,1); % first true beta's = 0.2

b(nvar) = log(1+p); %dummy coefficient equals log(1+p)

b(1) = 1; % first beta equals 1

normv = norminv(1-1/2*alpha,0,1);% upper normal value for

confidence intervals

tv = tinv(1-1/2*alpha,nobs-nvar); %calculate the upper

t-value s.t. p(t>tw) = 1/2*Alpha

% Create storage for the used arrays:

kenp = ones(nreps,1); %space for Kennedy estimator

bout = zeros(nvar,nreps); % storage for coefficient estimates

y = zeros(nobs,1); % Storage for y values

cl=zeros(nreps,1); cu=zeros(nreps,1); ncu=zeros(nreps,1);

ncl=zeros(nreps,1); %storage for confidence intervals

sum1 = 0; %initialize sum variable

sum2 = 0 ;

%Loop in which one random sample is created and confidence

intervals are constructed

for i = 1:nreps

x = [ones(nobs,1) randn(nobs,nvar-1)]; % random x for

every replication

x(:,nvar) = vertcat(ones(nobs/2,1),zeros(nobs/2,1));

%replace last column of X by dummy variables

evec = sigma2*randn(nobs,1); % random sample of errors

(20)

logy = log(y); % transform y

bout(:,i)= lscov(x,logy); % save ols estimators of beta's

e = logy-x*bout(:,i); % calculate residuals

s2 = (transpose(e)*e)/(nobs-nvar); %calculate s^2

varbout = inv(transpose(x)*x)*s2; %calculate estimated

variance of bout

vardum = varbout(nvar,nvar); %calculate variance of the

estimated dummy coefficient

c = bout(nvar,i); %dummy coefficient

% Excact confidence intervals

cl(i) = exp(c-tv*sqrt(vardum))-1; cu(i) = exp(c+tv*sqrt(vardum))-1;

%count number of times the confidence intervals contain

the true value

%of p

if cl(i)<p && cu(i)>p sum1 = sum1 +1; end

%Normal approximation of the confidence intervals

kenp(i) = exp(bout(nvar,i)-1/2*vardum)-1; %calculate the

kennedy approximation of p

varapp = exp(2*c)*(exp(-vardum)-exp(-2*vardum));

%calculate the approximatly unbiased estimator of the variance of kenp

% normal approximation of the confidence intervals

ncl(i) = kenp(i) - normv*sqrt(varapp); ncu(i) = kenp(i) + normv*sqrt(varapp);

%count number of times the approximated confidence

intervals contains

%the true value of p

if ncl(i)<p && ncu(i)>p sum2 = sum2+1; end

end

conf = [mean(cl) mean(cu)]; %average confidence intervals

nconf = [mean(ncl) mean(ncu)]; %average approximated

perc = sum1/nreps; % probability confidence interval contains

p

nperc = sum2/nreps; % probability approximated confidence

interval contains p

Eind = [conf perc nconf nperc]; disp(Eind)

toc %end timing

Matlab code Monte Carlo simulation heteroskedastic error terms using GLS

tic;% Start time measure

% Set initail values for main pars:

nobs = 500; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma1 = 0.3162; sigma2= 0.9487;

(21)

normv = norminv(1-1/2*alpha,0,1);% upper normal value

kenp = ones(nreps,1); %space for kennedy estimator

bout = zeros(nvar,nreps); % storage for estimates

boutgls = zeros(nvar,nreps); %storage for the gls estimators

sum2 = 0 ;

for i = 1:nreps

x = [ones(nobs,1) 1/2*randn(nobs,nvar-1)]; % random x for

every replication

% random sample

evec =

(sigma1*x(:,nvar)+sigma2*(1-x(:,nvar))).*randn(nobs,1); % heteroskydastic random sample of

errors

y = exp(x*b + evec); % generate true values of y

%GLS

e1 = logy-x*bout(:,i); % calculate residuals

z = [ones(nobs,1) x(:,4)]; %regress residuals on dummies

and constants

gam = lscov(z,log(e1.*e1));%estimate variance model

gam(1,1) = gam(1,1)+1.27; % exponential model correction

omega = diag(exp(z*gam)); % estimate omega matrix

boutgls(:,i) =

(transpose(x)*inv(omega)*x)\transpose(x)*inv(omega)*logy;%esti

mate gls coeficients

egls = logy-x*boutgls(:,i); % gls residuals

% estimate variance of bout

s2 = (transpose(egls)*egls)/(nobs-nvar);

varboutgls = inv(transpose(x)*inv(omega)*x); %calculate

estimated variance of bout

vardum = varboutgls(nvar,nvar); %calculate variance of the

%calculate the confidence intervals

c = boutgls(nvar,i); %gls coefficient estimate

%count number of times the confidence intervals contains

the true value of c

if cl(i)<p && cu(i)>p sum1 = sum1 +1;

(22)

end

%normal approximation of the confidence intervals

kenp(i) = exp(boutgls(nvar,i)-1/2*vardum)-1; %calculate

the kennedy approximation of p

% calculate the normal approximation of the confidence

interval of p

intervals contains

%the true value of c

end

p

nperc = sum2/nreps; % probability approximated confidence

interval contains p

Eind = [nobs conf perc nconf nperc]; disp(Eind)

toc; %end timing

Matlab code Monte Carlo simulation heteroskedastic error terms using HAC estimators

tic; %Start timing

% Set initail values for main pars:

nobs = 500; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma1 = 0.3162; sigma2= 0.9487; hac = 2;

% Set initial values for model elements:

kenp = ones(nreps,1); %space for kennedy estimator

bout = zeros(nvar,nreps); % storage for estimates

hc = zeros(nobs,1);

normv = norminv(1-1/2*alpha,0,1);% upper normal value

sum2 = 0 ;

(23)

x = [ones(nobs,1) 1/2*randn(nobs,nvar-1)]; % random x for every replication

% random sample

evec =

(sigma1*x(:,nvar)+sigma2*(1-x(:,nvar))).*randn(nobs,1); % heteroskydastic random sample of

errors

y = exp(x*b + evec); % generate true values of y

%OLS

e = logy-x*bout(:,i); % calculate residuals

H = x*inv(transpose(x)*x)*transpose(x);

% estimate variance of bout

e2 = e.*e./((1-diag(H)).^(hac-1)); % calculate squared

residuls

varbout =

inv(transpose(x)*x)*transpose(x)*diag(e2)*x*inv(transpose(x)*x

); %calculate white estimated variance of bout

vardum = varbout(nvar,nvar); %calculate variance of the

%calculate the confidence intervals

c = bout(nvar,i); %Ols coefficient estimate

%count number of times the confidence intervals contains

the true value of c

if cl(i)<p && cu(i)>p sum1 = sum1 +1; end

kenp(i) = exp(bout(nvar,i)-1/2*vardum)-1; %calculate the

kennedy approximation of p

% calculate the normal approximation of the confidence

interval of p

intervals contains

%the true value of c

end

(24)

nperc = sum2/nreps; % probability approximated confidence interval contains p

Eind = [nobs conf perc nconf nperc]; disp(Eind)

toc; %end timing

s

Confidence intervals for dummy percentage effects in loglinear regression models