Confidence intervals for dummy
percentage effects in loglinear regression
models
Tim Gunneweg
Supervisor: dhr. Dr. K.J. (Kees Jan) van Garderen
ABSTRACT
This paper considers confidence intervals of the percentage effect of
dummy variables in semilogarithmic regression models. First, a method
to construct exact confidence intervals in such models is introduced.
Next, this method is tested by comparing it with normal approximated
confidence intervals using Monte Carlo simulation. From the simulation
experiment it is concluded that the normal approximated confidence
intervals are misleading for small samples due to the non-‐normality of
the finite sample distribution of the percentage estimator. Furthermore,
two adjustments to the new technique, that can be used in the presence
of heteroskedasticity, are discussed and tested: a method based on FGLS
results and a method based on heteroskedastic consistent estimators.
The latter method preforms better in terms of coverage probability for
different model parameters, but the first might be improved using a
correction for small samples.
Table of Contents
1. INTRODUCTION 3
2. THEORETICAL FRAMEWORK 5
THE MODEL 5
DIFFERENT ESTIMATORS OF 𝒑𝒋 6
NORMAL APPROXIMATED CONFIDENCE INTERVALS 7
3. CONSTRUCTING CONFIDENCE INTERVALS 8
4. MONTE CARLO SIMULATION 10
RESULTS 10
5. HETEROSKEDASTIC ERROR TERMS 12
MONTE CARLO SIMULATION 14
RESULTS 14
6. DISCUSSION 16
7. REFERENCES 18
APPENDIX A HYPERGEOMETRIC FUNCTIONS 19
APPENDIX B MATLAB COMPUTATIONS 19
MATLAB CODE MONTE CARLO SIMULATION HOMOSCEDASTIC ERROR TERMS 19
MATLAB CODE MONTE CARLO SIMULATION HETEROSKEDASTIC ERROR TERMS USING GLS 20
MATLAB CODE MONTE CARLO SIMULATION HETEROSKEDASTIC ERROR TERMS USING HAC
ESTIMATORS 22
1. Introduction
In economic literature, log transformations are often used to test the percentage impact of dummy regressors on a dependent variable. For example, Immergluck (2008) studies the effect of different financial regulators on investments of American banks in housing and community development. In the United States, banks are encouraged to invest in housing and community development by the Community Reinvestment Act (CRA). This is a piece of legislation that makes it possible for financial regulators to base their approval of certain banking activities partly on the value of a bank’s investment in housing and community development. In the United States, there are four different regulators for
different types of banks. Immergluck (2008) estimates a linear regression model with the log of a bank’s CRA-‐qualified investments as dependent variable. As independent variables he uses various control variables together with a dummy for the regulators. Immergluck (2008) concludes that the effect of two regulators is significant, based on the significance of the OLS estimate of the dummy
coefficient. Furthermore, he finds the magnitudes of the three dummies to be 183%, 112% and 82% respectively, with the last one not being significant.
While concluding that the percentage effect is present and making a claim about its size, Immergluck (2008) does not provide confidence intervals of the percentage effect to show the concentration of the magnitude. Although not necessary to make conclusions about the presence of a percentage effect,
confidence intervals could be very useful to gain more insight in the spread of an effect. However, in the special case of a percentage effect of a dummy variable, the construction of confidence intervals is not as straight forward as it is for continuous variable because of the binary features of a dummy variable.
The characteristics of the percentage effect of dummy variables have been widely studied. Contrary to a continuous variable, the estimated coefficient of a dummy variable in a log linear model cannot be interpreted as the percentage effect of that variable on the independent variable. In contrast, showed by Garderen and Shah (2002), the percentage effect of a dummy variable should be estimated using the approximately unbiased Kennedy estimator, which is a function of the OLS estimate of the dummy coefficient and the OLS estimate of its variance. Furthermore they argue that this Kennedy estimator should be used together with an approximately unbiased estimator of its variance to measure its
spread. Nevertheless, they do not elaborate on how this spread should be interpreted exactly. These statistics could be used to estimate normal
approximations of the confidence intervals of the percentage effect when the distribution of the Kennedy estimator is close to normal. To examine the
reliability of these estimates, Giles (2011) formulates an expression for the finite sample density function of the Kennedy estimator and concludes that it is far from normal and therefore bootstrap methods should be used to estimate confidence intervals. However, the provided density function is incorrect as van Garderen points out (personal communication, May 9, 2014). Furthermore it is not clear whether bootstrap methods are optimal under these conditions.
The main goal of this study is to develop and test a method to construct exact confidence intervals for dummy variables in log linear models. The
resulting intervals are compared with confidence intervals based on the normal approximation of the percentage estimator.
The method to construct confidence intervals is examined in three ways. First, a technique to construct confidence intervals under perfect model
assumptions is derived theoretically. Next, Monte Carlo simulation is used to compare confidence intervals resulting from this technique with confidence intervals based on the normal approximation of the Kennedy estimator. Furthermore, it is demonstrated that these techniques and the normal
approximation technique can lead to different results. Finally, the implications of heteroskedasticity are discussed and two methods to solve this issue are
compared using Monte Carlo simulation.
This paper is organized as follows. In the second section, the theoretical framework is developed by formulating the considered model, explaining different estimators of the percentage effect, and investigating the normal approximation based on the Kennedy estimator. In the third section, a method for constructing confidence intervals is derived and explained. In the fourth section, the first simulation experiment is explained and its most important results are discussed. In the fifth section, the implications of heteroskedasticity are discussed and two methods to solve this issue are compared using Monte Carlo simulation. Finally, in the sixth section, some concluding remarks are made.
2. Theoretical Framework
In the following section, the necessary theoretical framework is established. First, the considered model is specified. Thereafter, two different estimators of the percentage effect are discussed: the minimum variance unbiased estimator and the Kennedy estimator. Finally, the normal approximation of confidence intervals of the Kennedy estimator is provided.
The model
The considered model can be specified as follows: 𝑌 = exp {𝑎 + 𝑏!𝑋! ! !!! + 𝑐!𝐷! ! !!! + 𝜀}
where exp . is defined as the element wise exponential function, the 𝑋!‘s are continuous variables, the 𝐷!’s are dummy variables and 𝜀 ~ 𝑁(0, 𝜎!𝐼
!). After taking the element wise log at both sides of the model equation, the model becomes linear. 𝐿𝑜𝑔 𝑌 = 𝑎 + 𝑏!𝑋! ! !!! + 𝑐!𝐷! ! !!! + 𝜀 (2.1)
Since the model is linear after the transformation, the optimal estimators of the coefficients can be obtained using OLS. In the continuous case, the coefficients of the resulting linear model multiplied by 100, can be interpreted as the
percentage effect of the independent variable on the dependent variable. To see why, differentiate both sides of (2.1) with respect to 𝑋! to obtain
𝑝! = 100 ∗1 𝑌 𝜕𝑌 𝜕𝑋! = 100 𝜕𝑙𝑛𝑌 𝜕𝑋! = 100𝑏!
In order to simplify notation, from now 𝑝! is defined as the relative change, without the factor 100. So in this case 𝑝! = !!!!!"
! = !"#$
!!! = 𝑏!.
For dummy variables, this does not hold since a dummy variable 𝐷! is binary and hence no continuous derivative of Y with respect to 𝐷! exists. The
percentage change 𝑝! of Y, from 𝑌! to 𝑌!, resulting from the change of 𝐷! from 0 to 1, should be calculated directly using 𝑝! = !!!!!
!! (van Garderen & Shah, 2002). Using (2.1), this leads to
𝑝! = !!!!! !! = !! !!− 1 = !"# !! !!!!!!!!!!!∗!! ! !"# !! !!!!!!!!!!!∗!! ! − 1 = exp 𝑐! − 1 so 𝑝! = exp 𝑐! − 1. (2.2)
So an unbiased estimator of 𝑝! should have this expectation.
Different estimators of 𝒑𝒋
Since in general c (from now on, the subscript j is dropped for clarity) is
unknown, (2.2) cannot be used directly to calculate p, but p has to be estimated using the OLS estimate 𝑐 of c. A simple, but wrong, solution to this problem, which is often used in literature (see references van Garderen and Shah, 2002), would be to replace 𝑐! in (2.2) with its OLS estimate 𝑐!. However, it is easy to see that this results in a biased estimator of p because:
𝐸 exp 𝑐 − 1 𝑋 > exp 𝐸 𝑐 𝑋 − 1 = exp 𝑐 − 1
where the inequality sign follows directly from Jensens inequality and the strict convexity of the exponential mapping.
Goldberger (1968) shows that the expected value of exp 𝑐 in fact equals exp 𝑐 +!!𝑉(𝑐) , with 𝑉 𝑐 the variance of 𝑐. Therefore, Kennedy (1981) argued to use the following estimator of p:
𝑝 = exp 𝑐 −12𝑉 𝑐 − 1) (2.3)
where 𝑐 is the OLS estimate of c and 𝑉 𝑐 is the OLS estimate of its variance. In their study, van Garderen and Shah (2002) show that the Kennedy estimator is biased and that the minimum variance unbiased estimator of p equals:
𝑝 = exp {𝑐}!𝐹! 𝑚; −1
2𝑚𝑉 𝑐 − 1 (2.4)
where 𝑐 and 𝑉 𝑐 are OLS estimates, 𝑚 =!!!! with n the number of observations and k the number of regressors and !𝐹! is the confluent hypergeometric
function (for an explanation of the used hypergeometric functions, see Appendix A). They also show that the variance of (2.4) equals:
𝑉 𝑝 = exp{2𝑐} exp 𝑉 𝑐 !𝐹! 𝑚; 𝑉 𝑐
! − 1
Furthermore, they proof that the minimum variance unbiased estimator of 𝑉 𝑝 is
𝑉 𝑝 = exp{2𝑐} {[!𝐹! 𝑚; −1 2𝑚𝑉 𝑐
!
−!𝐹! 𝑚; −2𝑚𝑉 𝑐 }
In addition to the minimum variance unbiased estimator of the variance of 𝑝, van Garderen and Shah (2002) derive the following approximately unbiased
estimator of its variance:
𝑉 𝑝 = exp 2𝑐 [exp {−𝑉 𝑐) − exp −2𝑉 𝑐 ] (2.5)
In their research, they further show that the unbiased estimates of 𝑝 are very close to those calculated by the much more convenient Kennedy estimator. This leads them to suggest that, in most applications, Kennedy’s estimator should be used together with their approximately unbiased estimator of the variance (2.5) when estimating the percentage impact of a dummy variable.
Normal Approximated Confidence Intervals
Although arguing that (2.3) should be used to measure the size of a percentage effect and that (2.5) should be used to measure its variance, van Garderen and Shah do not explain how these two statistics should be used to construct precise confidence intervals of p. Using only a point estimator and an estimator of its variance, confidence intervals are usually approximated using the normal
distribution. So in this case, approximated equal tailed 1 − 𝛼 confidence intervals could be constructed using
𝑐! 𝑥, 𝑦 = 𝑝 − 𝑆𝑞𝑟𝑡 𝑉 𝑝 𝑧1
2𝛼 and 𝑐! 𝑥, 𝑦 =𝑝 + 𝑆𝑞𝑟𝑡 𝑉 𝑝 𝑧 1
2𝛼 (2.6) Although being asymptotically accurate, (2.6) could be misleading when the distribution of 𝑝 is very different from normal. In order to find a basis for inference about 𝑝, Giles (2011) derives the finite sample distribution of 𝑝, but apparently contains an error. Van Garderen (personal communication, May 9, 2014) shows that the pdf equals
𝑓 𝑝 = 1 2!!!! 𝜋 exp −σ!+ 𝐿𝑜𝑔 1 + 𝑝 ! 2𝑑𝜎! ∗ 1 + 𝑝 !!! !!!!𝑣!! 𝑑𝜎! !!!!! 𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑈[𝑣 4; 1 2; −𝑐 + 𝑣 + 𝐿𝑜𝑔 1 + 𝑝 ! 2𝑑𝜎! ] (2.5)
where HypergeomU is Tricomi’s hypergeometric function. Figure 1 shows a plot of this density function with parameter values 𝜎! = 0.5, 𝑣 = 10, 𝑑 = 0.5, and 𝑐 = 0.25. From Figure 1 it is clear that the finite sample distribution of 𝑝 is far from normal, with the density function being positively skewed. Therefore, normal approximated confidence intervals could be misleading since they are shifted to the left compared to exact confidence intervals. To test the scale of this error, an exact method to find confidence intervals of 𝑝 is developed in the next section.
Figure 1: pdf of 𝒑 with 𝝈𝟐= 𝟎. 𝟓, 𝒗 = 𝟏𝟎, 𝒅 = 𝟎. 𝟓, and 𝒄 = 𝟎. 𝟐𝟓 3. Constructing Confidence Intervals
In this section, a method for the construction of exact confidence intervals is developed. The technique is based on confidence intervals for the coefficients in the transformed model using OLS estimates.
Consider the model as stated in (2.1). To distinguish between the stochastic random sample and an observed sample, first some notation is
introduced. X= (𝑋!𝑋!… 𝑋!𝐷!𝐷!… 𝐷!) and Y represent the stochastic random variables X and Y, before the sample is observed, and x and y represent observed outcomes of the random sample X and Y.
The goal is to find a two sided 1 − 𝛼 confidence interval 𝐼 of
𝑝! = exp 𝑐! − 1 based on a random sample with explanatory variables 𝑥 and dependent variable y. In this context, a two-‐sided 1 − 𝛼 confidence interval is defined by Bain and Engelhardt (1992, p. 360) as follows:
Definition 1. 𝐴𝑛 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 (𝑐! 𝑥, 𝑦 , 𝑐!(𝑥, 𝑦)) 𝑖𝑠 𝑐𝑎𝑙𝑙𝑒𝑑 𝑎 100% (1 − 𝛼) 𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 𝑓𝑜𝑟 𝑝 𝑖𝑓 𝑃 𝐶! 𝑋, 𝑌 < 𝑝 < 𝐶! 𝑋, 𝑌 = 1 − 𝛼 𝑤ℎ𝑒𝑟𝑒 𝛼 ∈ 0,1 , 𝑥 𝑎𝑛𝑑 𝑦 𝑎𝑟𝑒 𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑡ℎ𝑒 𝑟𝑎𝑛𝑑𝑜𝑚 𝑠𝑎𝑚𝑝𝑙𝑒 𝑋 𝑎𝑛𝑑 𝑌 𝑎𝑛𝑑 𝑐! 𝑥, 𝑦 𝑎𝑛𝑑 𝑐! 𝑥, 𝑦 𝑎𝑟𝑒 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛𝑠 𝑜𝑓 𝑥 𝑎𝑛𝑑 𝑦
Note that in definition 1, the statistics 𝐶! 𝑋, 𝑌 and 𝐶! 𝑋, 𝑌 are stochastic because they are functions of the random variables X and Y. On the other hand, 𝑐! 𝑥, 𝑦 and 𝑐! 𝑥, 𝑦 are observed values of these statistics in the case that the observed value of X is x and the observed value of Y is y.
In this context, confidence intervals can be constructed using Theorem 1.
Theorem 1.
Let 𝑐 be the OLS estimate of the coefficient of a dummy variable, D, in a loglinear regression model as specified in (2.1), and 𝑉(𝑐) the OLS estimate of its variance. An equal tailed, two sided 1 − 𝛼 confidence interval of the relative change, 𝑝, in Y due to D changing from 0 to 1 is given by (𝑐! 𝑥, 𝑦 , 𝑐! 𝑥, 𝑦 ) with
𝑐! 𝑥, 𝑦 = exp 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 𝑐! 𝑥, 𝑦 = exp 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 (3.1)
where n is the sample size, k is the number of regressors and 𝑡!
!!,!!! is the t value with n-‐k degrees of freedom such that the probability to the right of it is 𝛼.
Proof.
Using (3.1) in Definition 1 gives 𝑃 𝐶! 𝑋, 𝑌 < 𝑝 < 𝐶! 𝑋, 𝑌 =
𝑃 exp 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! − 1 < 𝑝 < exp 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! − 1 = 𝑃 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! < 𝐿𝑜𝑔 𝑝 + 1 < 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! = 𝑃 𝑐 − 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡! !!,!!! < 𝑐 < 𝑐 + 𝑆𝑞𝑟𝑡 𝑉 𝑐 ∗ 𝑡!!!,!!! = 𝑃 −𝑡! !!,!!! < 𝑐 − 𝑐 𝑆𝑞𝑟𝑡 𝑉 𝑐 < 𝑡!!!,!!! = 1 − 𝛼
where the fourth equality follows from (2.2) and the last equality follows from the fact that !"#$ ! !!!! ~𝑡 𝑛 − 𝑘 . The equality of the tails follows from the symmetry of the problem. This completes the proof.
4. Monte Carlo Simulation
Monte Carlo simulation is used to examine small sample properties of the exact confidence intervals (3.1) and their normal approximation (2.6). All simulations are done using Matlab and can be found in appendix B. The simulations are based on the model:
𝐿𝑜𝑔 𝑌! = 𝑎 + 𝑏!𝑋!!+ 𝑏!𝑋!!+ 𝑐𝐷! + 𝜀!, where 𝜀!~𝑖. 𝑖. 𝑑. 𝑁 0, 𝜎! .
To cancel out the effect of specific values of the regressors, 𝑋! and 𝑋! are
regenerated each replication as standard normal variables. The first ½*n values of the variable D are set equal to 1 and the second ½*n values are set equal to zero. The initial values of the model parameters are set to 𝑎 = 1, 𝑏! = 𝑏! = 0.2, 𝑐 = 0.2231 𝑎𝑛𝑑 𝜎! = 0.25, so that the percentage effect of 𝑋
! and 𝑋! equals 20% and the percentage effect of 𝐷! equals 𝑝 = exp 𝑐 − 1 = 25%. Using (2.6) and (3.1), equal tailed 95% confidence intervals are calculated for different sample sizes using 10000 replications for each sample size.
Results
In Table 1, for both methods the average values of the confidence limits are reported together with their coverage probabilities. From Table 1, it is clear that the normal approximated confidence intervals are much wider than the exact confidence intervals, especially for small sample sizes. Furthermore, the normal approximated intervals are shifted downwards compared to the exact intervals. As a result of these two effects, the coverage probability of the normal
especially the case for small sample sizes. As the sample size increases, the normal approximation approaches the exact intervals, and thereby the coverage ratio increases. This result is illustrated by figures 1 and 2 were the confidence limits of the different methods and the coverage probabilities are plotted for different sample sizes. From these results, it is clear that normal approximated confidence intervals can be very misleading when sample sizes are small.
Table 1. Average confidence intervals and coverage probabilities for different sample sizes
n Exact confidence intervals Normal Approximation
cl cu CP cl cu CP 10 -‐0.416 2.380 0.950 -‐0.574 1.069 0.858 20 -‐0.214 1.118 0.952 -‐0.314 0.812 0.912 30 -‐0.134 0.877 0.951 -‐0.206 0.706 0.926 40 -‐0.087 0.758 0.949 -‐0.143 0.642 0.930 50 -‐0.054 0.687 0.950 -‐0.100 0.600 0.934 60 -‐0.029 0.640 0.951 -‐0.069 0.570 0.939 70 -‐0.010 0.604 0.950 -‐0.045 0.545 0.940 80 0.005 0.575 0.949 -‐0.026 0.525 0.940 90 0.018 0.553 0.952 -‐0.010 0.510 0.944 100 0.029 0.535 0.950 0.004 0.496 0.943 150 0.067 0.475 0.951 0.050 0.451 0.946 250 0.106 0.420 0.949 0.095 0.406 0.947 500 0.146 0.366 0.949 0.140 0.359 0.948 1000 0.175 0.331 0.950 0.173 0.328 0.949 5000 0.216 0.285 0.950 0.215 0.285 0.950 10000 0.226 0.275 0.949 0.226 0.275 0.949
Figure 2: confidence intervals of 𝒑 for different sample sizes
-‐1,0 -‐0,5 0,0 0,5 1,0 1,5 2,0 2,5 3,0 Value Sample Size cl cu Ncl Ncu
Figure 3: Coverage probabilities of both confidence intervals for different samples sizes
5. Heteroskedastic Error Terms
Although Theorem 1 gives an exact method to construct confidence intervals, it rests on quite strong assumptions about the data generating process. Especially the assumption about homoscedastic error terms is in practice often violated. In this section, the implications of heteroskedastic error terms are explored.
First, consider the model:
𝑦 = 𝐿𝑜𝑔 𝑌 = 𝑋𝑏 + 𝜀,
where b is a vector with dummy and continuous variables and 𝜀~𝑁(0, Ω) with known positive definite diagonal matrix Ω. The model can now be estimated using GLS. The model is transformed by multiplying both sides of the model equation with Ω!!!. Consequently :
Ω!!!𝑦 = Ω!!!𝑋𝑏 + Ω!!!𝜀,
where Ω!!!𝜀~𝑁 0, 𝐼! . In this case, the estimated dummy coefficients are 𝑏!"# = 𝑋!Ω!!𝑋 !!𝑋′Ω!!𝑦 with variance 𝑉𝑎𝑟 𝑏
!"# = 𝑋!Ω!!𝑋 !!. Since these estimates result from a linear regression equation, the t-‐statistic follows a t-‐ distribution with n-‐k degrees of freedom. Therefore, Theorem 1 can be applied using the GLS estimates. This results in the following expressions for the confidence limits: 𝑐! 𝑥, 𝑦 = exp 𝑐!"# − 𝑆𝑞𝑟𝑡 𝑉 𝑐!"# ∗ 𝑡! !!,!!! − 1 (5.1) 0,80 0,82 0,84 0,86 0,88 0,90 0,92 0,94 0,96 Pr ob ab ili ty Sample Size Exact CP N App CP
𝑐! 𝑥, 𝑦 = exp 𝑐!"#+ 𝑆𝑞𝑟𝑡 𝑉 𝑐!"# ∗ 𝑡!
!!,!!! − 1
However, in practice the structure of Ω is unknown, and consequently (5.1) cannot be used to construct confidence intervals. In the following section, two methods to solve this issue are discussed. The first method uses an estimate of the matrix Ω in combination with the results derived above. The second
method is based on a heteroskedastic consistent estimators of the variance of the dummy coefficient.
The first method uses two-‐step FGLS results to estimate the covariance matrix Ω, by estimating the following model of 𝜎!!
𝜎!! = exp {𝑧
!!𝛾}, where 𝑧 = (1, 𝑧!… . 𝑧!) is an vector of explanatory variables (Heij et al, 2004, p,337). The exponential transformation is used to guarantee that estimated values of 𝜎!! are positive. In the first step, OLS is applied in the model 𝐿𝑜𝑔 𝑌 = 𝑋𝑏 + 𝜀. If 𝑏 is consistent, the squared residuals 𝑒!! of this regression are asymptotic unbiased estimates of 𝜎!!. Therefore, in the second step, the following model is estimated to find the values of 𝛾
𝐿𝑜𝑔 𝑒!! = 𝑧
!!𝛾 + 𝜂!
The coefficients 𝛾! are estimated consistently for 𝑖 = 2, … , 𝑝 but 𝛾! should be estimated using 𝛾!− 𝐸 log 𝜒! 1 = 𝛾
!+ 1.27 as Heij et al (2004, p337) point out. Finally, Ω is estimated by σ!!! = exp 𝑧
!!𝛾 . Using this estimate of Ω, (5.1) can be used to construct confidence intervals. The normal approximation (2.6) can be calculated using the FGLS estimate of the coefficient and the estimated covariance matrix in (2.3) and (2.5).
The second method uses White standard errors with a correction for small samples to estimate the variance of the coefficient estimator. This is a method to find hetroskedastic consistent estimators of the variance of 𝑏. The biggest advantage of this method is that it can be used when no model of the variance is known. Instead of using GLS, OLS is used to estimate the model parameters. Furthermore, the variance of the coefficients is estimated using
𝑣𝑎𝑟 𝑏 = 𝑋!𝑋 !!𝑋!𝑑𝑖𝑎𝑔 𝑒!!
1 − ℎ!! 𝑋 𝑋!𝑋 !! (5.2) where ℎ!! is the 𝑖𝑖!! element of 𝐻 = 𝑋 𝑋!𝑋 𝑋′. This estimator of the variance can be used in (5.1) together with the OLS estimate of the dummy coefficients to
construct confidence intervals. The normal approximation (2.6) can be calculated using the OLS coefficient together with (5.2) in (2.3) and (2.5).
Monte Carlo simulation
Monte Carlo simulation is used to compare the techniques developed in previous section. Simulation is based on the same model as before, but now with
𝜀!~𝑁(0, 𝜎!!𝐷! + 𝜎!! 1 − 𝐷! ) thus the variance of 𝜀! depending on the value of 𝐷!. This specification of the heteroskedasticity is used because it is logical to assume that the variance is different for different groups within the sample. Again, to cancel out the effect of specific values of the regressors, 𝑋! and 𝑋! are
regenerated each replication as standard normal variables. The first ½*n values of the variable D are set equal to 1 and the second ½*n values are set equal to zero. The initial values of the model parameters are set to 𝑎 = 1, 𝑏! = 𝑏! = 0.2, 𝑐 = 0.2231 𝑎𝑛𝑑 𝜎!! = 0.25 𝑎𝑛𝑑 𝜎
!! = 0.64, so that the percentage effect of 𝑋! and 𝑋! equals 20% and the percentage effect of 𝐷! equals 𝑝 = exp 𝑐 − 1 = 25%. Using both techniques, equal tailed 95% confidence intervals are calculated for different sample sizes, calculating 10000 replications for each sample size. In the GLS method, the covariance matrix is estimated using the dummy variable and a constant as regressors, thus 𝑧! = 1, 𝐷! !.
Results
In Tabel 2 the average confidence intervals and coverage probabilities for both methods and their normal approximations are reported. It is clear that both methods outperform their normal approximations significantly. Again, this effect vanishes as the sample size increases. Furthermore, the coverage probability of the White method is larger than that of the GLS method for all sample sizes, with the difference being smaller for larger samples. This is illustrated by figure 4, where the coverage probability of both methods is plotted for different sample sizes.
Table 2: Confidence intervals and coverage probabilities for 𝝈𝟏𝟐= 𝟎. 𝟐𝟓 𝒂𝒏𝒅 𝝈 𝟐
𝟐= 𝟎. 𝟔𝟒
n White Normal approximation White GLS Normal approximation GLS cl cu CP cl cu CP cl cu CP cl cu CP
10 -‐0.502 4.154 0.935 -‐0.769 1.288 0.814 -‐0.459 3.632 0.911 -‐0.695 1.256 0.795 20 -‐0.316 1.526 0.944 -‐0.486 0.961 0.886 -‐0.298 1.544 0.927 -‐0.467 0.974 0.874 30 -‐0.235 1.149 0.948 -‐0.360 0.834 0.902 -‐0.222 1.147 0.940 -‐0.346 0.835 0.903 50 -‐0.137 0.869 0.946 -‐0.218 0.709 0.917 -‐0.132 0.868 0.943 -‐0.213 0.708 0.921 100 -‐0.033 0.648 0.947 -‐0.077 0.579 0.935 -‐0.030 0.651 0.947 -‐0.074 0.581 0.936 250 0.062 0.482 0.952 0.044 0.457 0.946 0.062 0.481 0.950 0.043 0.456 0.947 500 0.113 0.408 0.949 0.103 0.396 0.947 0.116 0.412 0.949 0.106 0.400 0.948
Figure 4: Confidence intervals and coverage probabilities for 𝝈𝟏𝟐= 𝟎. 𝟐𝟓 𝒂𝒏𝒅 𝝈 𝟐
𝟐= 𝟎. 𝟔𝟒
Figure 5: Coverage probabilities for 𝝈𝟏𝟐= 𝟎. 𝟏 𝒂𝒏𝒅 𝝈𝟐𝟐= 𝟎. 𝟗 0,89 0,90 0,91 0,92 0,93 0,94 0,95 0,96 10 20 30 50 100 250 500 Co ve ra ge P ro b ab il it y Sample Size White CP GLS CP 0,87 0,88 0,89 0,90 0,91 0,92 0,93 0,94 0,95 0,96 10 20 30 50 100 250 500 Co ve ra ge P ro b ab il it y Sample Size White CP GLS CP
To test the effect of the size of the heteroskedasticity, the simulation experiment is repeated with 𝜎!! = 0.1 𝑎𝑛𝑑 𝜎
!! = 0.9. The results are summarized in Table 3. Again, the coverage probability of the White confidence intervals outperforms the coverage ratio of the GLS confidence intervals. Furthermore, the normal approximations are unreliable in terms of coverage probabilities. These results indicate that the White confidence intervals preform better for different magnitudes of heteroskedsticity.
Table 3: Coverage probabilities for 𝝈𝟏𝟐= 𝟎. 𝟏 𝒂𝒏𝒅 𝝈 𝟐 𝟐= 𝟎. 𝟗
n Exact White Normal approximation White Exact GLS Normal approximation GLS cl cu CP cl cu CP cl cu CP cl cu CP
10 -‐0.515 4.995 0.928 -‐0.803 1.309 0.795 -‐0.468 3.774 0.902 -‐0.713 1.259 0.776 20 -‐0.327 1.687 0.937 -‐0.519 1.020 0.879 -‐0.308 1.623 0.922 -‐0.488 0.993 0.871 30 -‐0.244 1.232 0.943 -‐0.383 0.875 0.906 -‐0.230 1.216 0.932 -‐0.365 0.866 0.899 50 -‐0.150 0.922 0.947 -‐0.241 0.740 0.925 -‐0.141 0.920 0.937 -‐0.230 0.739 0.917 100 -‐0.047 0.677 0.947 -‐0.096 0.598 0.937 -‐0.045 0.673 0.943 -‐0.094 0.595 0.930 250 0.053 0.498 0.949 0.031 0.469 0.942 0.052 0.494 0.946 0.031 0.466 0.941 500 0.106 0.418 0.950 0.095 0.404 0.948 0.107 0.419 0.949 0.096 0.405 0.948 6. Discussion
Log linear regression models are often used to model percentage effects in economic relations. For continuous variables, interpretation of the estimated coefficients follows from the differentiation of the model with respect to the corresponding variable. Due to their binary characteristics, the interpretation of dummy variables is not as straightforward since no continuous derivative with respect to a dummy exists. In recent studies, unbiased and approximately unbiased estimators of the percentage effect of dummy variables have been developed and tested together with unbiased and approximately unbiased estimators of their variance. In their research, van Garderen and Shah (2002) argue that the estimator provided by Kennedy’s (1985) could be used safely to estimate the size of a percentage effect. Furthermore, they derive a convenient approximately unbiased estimator of it variance. Although providing point estimates and measures of spread, none of the recent studies gives an exact method for the construction of confidence intervals of the percentage effect of dummy variables.
In this paper, an exact method to construct confidence intervals under perfect model assumptions is developed. Furthermore, two possible adjustments
that can be made in the case of heteroskedastic error terms are discussed: a method based on HC estimates of the variance of the coefficient and a method based on two step FWLS. Using Monte Carlo simulation, all methods are tested together with normal approximations based on Kennedy’s (1981) estimator of the percentage effect and van Garderen and Shah’s (2002) approximately unbiased estimator of its variance.
From the simulation experiment, it is clear that small sample confidence intervals based on the normal approximation can be misleading for two reasons: they are shifted to the left and they are much smaller than the exact intervals. Therefore, under classic model assumptions, the exact method should be preferred above the normal approximation when samples sizes are small.
In the case of heteroskedastic error terms, the method based on
heteroskedastic consistent estimates of the variance of the dummy coefficient, outperforms the method based on FGLS results in terms of coverage probability for all sample sizes and different magnitudes of heteroskedasticitiy. These results are counterintuitive since the FGLS method uses more information about the data generating process. However, one should be cautious drawing
conclusions from these seemingly strong results for the following reason. In the first step of the FGLS estimation, the possible model of the variance is estimated by replacing 𝜎!! by 𝑒
!! because 𝑒!! is an asymptotic unbiased estimate of 𝜎!!. Nevertheless, it is known that under classic model assumptions 𝐸[𝑒!!] = 𝜎! 1 − ℎ
!! where ℎ!! is the 𝑖!! diagonal element of 𝐻 = 𝑋 𝑋!𝑋 !!𝑋′. As a result, a small sample correction factor might be needed when estimating the value of 𝜎!!. Therefore, more research is needed before conclusions about best responses to heteroskedastic error terms can be drawn.
7. References
Bain, L. J., & Engelhardt, M. (1992). Introduction to probability and
mathematical statistics (Vol. 4). Belmont, CA: Duxbury Press.
van Garderen, K.J., & Shah, C. (2002). Exact interpretation of dummy
variables in semilogarithmic equations. The Econometrics Journal, 5(1), 149-‐159. Giles, D. E. (2011). Interpreting dummy variables in semi-‐logarithmic
regression models: Exact distributional results (Working Paper No. 1101).
Department of Economics, University of Victoria.
Heij, C. de Boer, P., Franses, P.H., Kloek, T.,& van Dijk., H.K. (2004)
Econometric methods with applications in business and economics (Vol. 5). Oxford
University Press.
Immergluck, D. (2008). Out of the goodness of their hearts? Regulatory and regional impacts on bank investment in housing and community development in the United States. Journal of Urban Affairs, 30(1), 1-‐20.
Kennedy, P. E. (1981). Estimation with correctly interpreted dummy
variables in semilogarithmic equations [the interpretation of dummy variables in semilogarithmic equations]. American Economic Review, 71(4).
Appendix A Hypergeometric functions
The hypergeometric series !𝐹! 𝑎!, … , 𝑎!, 𝑏!, … 𝑏!; 𝑧 is defined as
!𝐹! 𝑎!, … , 𝑎!, 𝑏!, … 𝑏!; 𝑧 = 𝑎! !… 𝑎! ! 𝑏! !… 𝑏! ! ! !!! 𝑧! 𝑛! , where 𝑥 ! represents the Pochhammer symbol defined as
𝑥 ! = 1 𝑎𝑛𝑑 𝑥 ! = 𝑥 𝑥 + 1 … (𝑥 + 𝑖 − 1)
Tricomi’s hypergeometric function, used in (2.5), can be defined in terms of hypergeometric series by
𝐻𝑦𝑝𝑒𝑟𝑔𝑒𝑜𝑚𝑈 𝑎, 𝑏, 𝑧 = 𝜋 csc 𝜋𝑏 !𝐹! 𝑎, 𝑏, 𝑧 Γ 𝑎 + 𝑏 − 1 − 𝑧!!! !𝐹! 𝑎 − 𝑏 + 1,2 − 𝑏 Γ 𝑎
Appendix B Matlab computations
Matlab code Monte Carlo simulation homoscedastic error terms tic % start time measure
% Set initail values for main parameters:
nobs = 10; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma2 = 1/2
% Set initial values for model elements:
b = 2/10*ones(nvar,1); % first true beta's = 0.2
b(nvar) = log(1+p); %dummy coefficient equals log(1+p)
b(1) = 1; % first beta equals 1
normv = norminv(1-1/2*alpha,0,1);% upper normal value for
confidence intervals
tv = tinv(1-1/2*alpha,nobs-nvar); %calculate the upper
t-value s.t. p(t>tw) = 1/2*Alpha
% Create storage for the used arrays:
kenp = ones(nreps,1); %space for Kennedy estimator
bout = zeros(nvar,nreps); % storage for coefficient estimates
y = zeros(nobs,1); % Storage for y values
cl=zeros(nreps,1); cu=zeros(nreps,1); ncu=zeros(nreps,1);
ncl=zeros(nreps,1); %storage for confidence intervals
sum1 = 0; %initialize sum variable
sum2 = 0 ;
%Loop in which one random sample is created and confidence
intervals are constructed
for i = 1:nreps
x = [ones(nobs,1) randn(nobs,nvar-1)]; % random x for
every replication
x(:,nvar) = vertcat(ones(nobs/2,1),zeros(nobs/2,1));
%replace last column of X by dummy variables
evec = sigma2*randn(nobs,1); % random sample of errors
logy = log(y); % transform y
bout(:,i)= lscov(x,logy); % save ols estimators of beta's
e = logy-x*bout(:,i); % calculate residuals
s2 = (transpose(e)*e)/(nobs-nvar); %calculate s^2
varbout = inv(transpose(x)*x)*s2; %calculate estimated
variance of bout
vardum = varbout(nvar,nvar); %calculate variance of the
estimated dummy coefficient
c = bout(nvar,i); %dummy coefficient
% Excact confidence intervals
cl(i) = exp(c-tv*sqrt(vardum))-1; cu(i) = exp(c+tv*sqrt(vardum))-1;
%count number of times the confidence intervals contain
the true value
%of p
if cl(i)<p && cu(i)>p sum1 = sum1 +1; end
%Normal approximation of the confidence intervals
kenp(i) = exp(bout(nvar,i)-1/2*vardum)-1; %calculate the
kennedy approximation of p
varapp = exp(2*c)*(exp(-vardum)-exp(-2*vardum));
%calculate the approximatly unbiased estimator of the variance of kenp
% normal approximation of the confidence intervals
ncl(i) = kenp(i) - normv*sqrt(varapp); ncu(i) = kenp(i) + normv*sqrt(varapp);
%count number of times the approximated confidence
intervals contains
%the true value of p
if ncl(i)<p && ncu(i)>p sum2 = sum2+1; end
end
conf = [mean(cl) mean(cu)]; %average confidence intervals
nconf = [mean(ncl) mean(ncu)]; %average approximated
confidence intervals
perc = sum1/nreps; % probability confidence interval contains
p
nperc = sum2/nreps; % probability approximated confidence
interval contains p
Eind = [conf perc nconf nperc]; disp(Eind)
toc %end timing
Matlab code Monte Carlo simulation heteroskedastic error terms using GLS
tic;% Start time measure
% Set initail values for main pars:
nobs = 500; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma1 = 0.3162; sigma2= 0.9487;
b = 2/10*ones(nvar,1); % first true beta's = 0.2
b(nvar) = log(1+p); %dummy coefficient equals log(1+p)
b(1) = 1; % first beta equals 1
normv = norminv(1-1/2*alpha,0,1);% upper normal value
tv = tinv(1-1/2*alpha,nobs-nvar); %calculate the upper
t-value s.t. p(t>tw) = 1/2*Alpha
% Create storage for the used arrays:
kenp = ones(nreps,1); %space for kennedy estimator
bout = zeros(nvar,nreps); % storage for estimates
y = zeros(nobs,1); % Storage for y values
cl=zeros(nreps,1); cu=zeros(nreps,1); ncu=zeros(nreps,1);
ncl=zeros(nreps,1); %storage for confidence intervals
boutgls = zeros(nvar,nreps); %storage for the gls estimators
sum1 = 0; %initialize sum variable
sum2 = 0 ;
for i = 1:nreps
x = [ones(nobs,1) 1/2*randn(nobs,nvar-1)]; % random x for
every replication
x(:,nvar) = vertcat(ones(nobs/2,1),zeros(nobs/2,1));
%replace last column of X by dummy variables
% random sample
evec =
(sigma1*x(:,nvar)+sigma2*(1-x(:,nvar))).*randn(nobs,1); % heteroskydastic random sample of
errors
y = exp(x*b + evec); % generate true values of y
logy = log(y); % transform y
%GLS
bout(:,i)= lscov(x,logy); % save ols estimators of beta's
e1 = logy-x*bout(:,i); % calculate residuals
z = [ones(nobs,1) x(:,4)]; %regress residuals on dummies
and constants
gam = lscov(z,log(e1.*e1));%estimate variance model
gam(1,1) = gam(1,1)+1.27; % exponential model correction
omega = diag(exp(z*gam)); % estimate omega matrix
boutgls(:,i) =
(transpose(x)*inv(omega)*x)\transpose(x)*inv(omega)*logy;%esti
mate gls coeficients
egls = logy-x*boutgls(:,i); % gls residuals
% estimate variance of bout
s2 = (transpose(egls)*egls)/(nobs-nvar);
varboutgls = inv(transpose(x)*inv(omega)*x); %calculate
estimated variance of bout
vardum = varboutgls(nvar,nvar); %calculate variance of the
estimated dummy coefficient
%calculate the confidence intervals
c = boutgls(nvar,i); %gls coefficient estimate
cl(i) = exp(c-tv*sqrt(vardum))-1; cu(i) = exp(c+tv*sqrt(vardum))-1;
%count number of times the confidence intervals contains
the true value of c
if cl(i)<p && cu(i)>p sum1 = sum1 +1;
end
%normal approximation of the confidence intervals
kenp(i) = exp(boutgls(nvar,i)-1/2*vardum)-1; %calculate
the kennedy approximation of p
varapp = exp(2*c)*(exp(-vardum)-exp(-2*vardum));
%calculate the approximatly unbiased estimator of the variance of kenp
% calculate the normal approximation of the confidence
interval of p
ncl(i) = kenp(i) - normv*sqrt(varapp); ncu(i) = kenp(i) + normv*sqrt(varapp);
%count number of times the approximated confidence
intervals contains
%the true value of c
if ncl(i)<p && ncu(i)>p sum2 = sum2+1; end
end
conf = [mean(cl) mean(cu)]; %average confidence intervals
nconf = [mean(ncl) mean(ncu)]; %average approximated
confidence intervals
perc = sum1/nreps; % probability confidence interval contains
p
nperc = sum2/nreps; % probability approximated confidence
interval contains p
Eind = [nobs conf perc nconf nperc]; disp(Eind)
toc; %end timing
Matlab code Monte Carlo simulation heteroskedastic error terms using HAC estimators
tic; %Start timing
% Set initail values for main pars:
nobs = 500; nvar = 4; nreps = 10000; alpha=0.05; p = 0.25; sigma1 = 0.3162; sigma2= 0.9487; hac = 2;
% Set initial values for model elements:
b = 2/10*ones(nvar,1); % first true beta's = 0.1
b(nvar) = log(1+p); %dummy coefficient equals log(1+p)
b(1) = 1; % first beta equals 1
% Create storage for the used arrays:
kenp = ones(nreps,1); %space for kennedy estimator
bout = zeros(nvar,nreps); % storage for estimates
y = zeros(nobs,1); % Storage for y values
cl=zeros(nreps,1); cu=zeros(nreps,1); ncu=zeros(nreps,1);
ncl=zeros(nreps,1); %storage for confidence intervals
hc = zeros(nobs,1);
normv = norminv(1-1/2*alpha,0,1);% upper normal value
tv = tinv(1-1/2*alpha,nobs-nvar); %calculate the upper
t-value s.t. p(t>tw) = 1/2*Alpha
sum1 = 0; %initialize sum variable
sum2 = 0 ;
x = [ones(nobs,1) 1/2*randn(nobs,nvar-1)]; % random x for every replication
x(:,nvar) = vertcat(ones(nobs/2,1),zeros(nobs/2,1));
%replace last column of X by dummy variables
% random sample
evec =
(sigma1*x(:,nvar)+sigma2*(1-x(:,nvar))).*randn(nobs,1); % heteroskydastic random sample of
errors
y = exp(x*b + evec); % generate true values of y
logy = log(y); % transform y
%OLS
bout(:,i)= lscov(x,logy); % save ols estimators of beta's
e = logy-x*bout(:,i); % calculate residuals
H = x*inv(transpose(x)*x)*transpose(x);
% estimate variance of bout
e2 = e.*e./((1-diag(H)).^(hac-1)); % calculate squared
residuls
varbout =
inv(transpose(x)*x)*transpose(x)*diag(e2)*x*inv(transpose(x)*x
); %calculate white estimated variance of bout
vardum = varbout(nvar,nvar); %calculate variance of the
estimated dummy coefficient
%calculate the confidence intervals
c = bout(nvar,i); %Ols coefficient estimate
cl(i) = exp(c-tv*sqrt(vardum))-1; cu(i) = exp(c+tv*sqrt(vardum))-1;
%count number of times the confidence intervals contains
the true value of c
if cl(i)<p && cu(i)>p sum1 = sum1 +1; end
kenp(i) = exp(bout(nvar,i)-1/2*vardum)-1; %calculate the
kennedy approximation of p
varapp = exp(2*c)*(exp(-vardum)-exp(-2*vardum));
%calculate the approximatly unbiased estimator of the variance of kenp
% calculate the normal approximation of the confidence
interval of p
ncl(i) = kenp(i) - normv*sqrt(varapp); ncu(i) = kenp(i) + normv*sqrt(varapp);
%count number of times the approximated confidence
intervals contains
%the true value of c
if ncl(i)<p && ncu(i)>p sum2 = sum2+1; end
end
conf = [mean(cl) mean(cu)]; %average confidence intervals
nconf = [mean(ncl) mean(ncu)]; %average approximated
confidence intervals
perc = sum1/nreps; % probability confidence interval contains
nperc = sum2/nreps; % probability approximated confidence interval contains p
Eind = [nobs conf perc nconf nperc]; disp(Eind)
toc; %end timing
s