• No results found

Optimal prediction in loglinear models in the presence of endogenous variables

N/A
N/A
Protected

Academic year: 2021

Share "Optimal prediction in loglinear models in the presence of endogenous variables"

Copied!
15
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Optimal prediction in loglinear models in the presence

of endogenous variables

Ruben Visser

Student number: 10063560

Econometrie & Operationele Research

Mentor: K.J. van Garderen

August-2014

Amsterdam

(2)

Index

Optimal prediction in loglinear models in the presence of endogenous variables

1. Introduction……….. 2 2. Theoretical framework………. 3 3. Research design……….. 5 4. Analysis………. 9 5. Conclusion………... 12 6. Discussion……… 13 7. References………... 14 1

(3)

1. Introduction

Dealing with percentage changes has always been an important issue in economic theories. When investigating the GDP, the inflation or the country’s debt, most people have no idea what the actual numbers are, since everyone is primarily interested in the percentage change relative to the previous year.

Many economic articles have been written about percentage changes and often they make use of a logarithmic model. Whenever research is done of the influence of a single variable on the total percentage change, most often a loglinear regression model is used. This logarithmic transformation of the standard linear model, has the advantage that it can be estimated by means of ordinary least squares (OLS). But as van Garderen (2001) stated: ‘However, one is generally interested in predicting the original variables, not the variables in logs’.

Taking the inverse of this logarithmic transformation results in a biased

predictor. In his article, van Garderen derived an exact solution to this problem. He proved that by means of an inverse Laplace transformation, an exact predictor could be derived for the prediction of the original variable. Furthermore he compared this predictor with previously found predictors for the original variables.

In the exponential model used in his paper, van Garderen made several strict assumptions about the variance and about the nature of the variables used in the model. In this paper further research will be done with respect to the matter of predicting original variables in loglinear models. By loosening some of the

assumptions made by van Garderen, the relevance of the derived predictor is tested in less optimal situations.

In this paper the effect of endogenous variables will be investigated. Of particular interest is the question of how relevant the exact predictor still is, when loosening the assumption of strictly exogenous variables. Since endogeneity is a very common problem in econometric research and can be of great influence on the final results, this matter should not be considered negligible.

Endogeneity can be present in various degrees of severity, and attention is paid to the differences by comparing the mean squared prediction error (MSPE) for various degrees of endogeneity. Since the exact predictor is derived under the assumption that the variables are all exogenous, it is expected that the MSPE will be higher when the variables X are endogenous variables, especially when the variables are strongly endogenous.

First the model as used by Van Garderen (2001) will be reconstructed and the results will be replicated. In this model all standard assumptions of the linear regression model apply, and this makes it the perfect benchmark in doing research for loglinear models. Second, this model will be adapted for endogenous variables. Econometric theory states that the standard Gauss-Markov assumptions, which will be considered later on, do no longer apply. However, the purpose of this paper is

(4)

investigating what the influence is of loosening these assumptions. Therefore finally MSPE’s of various predictors will be compared for various levels of endogeneity.

2. Theoretical framework

If the assumption of exogeneity is violated, it means that Xi en εi are mutually

correlated (e.g., Heij et al, 2004). Since the standard regression model is 𝑦𝑦𝑖𝑖 = 𝑥𝑥𝑖𝑖′𝛽𝛽 + 𝜀𝜀𝑖𝑖

it is intuitively clear that it is hard to estimate the individual contributions to the

depended variable yi when xi and εi fail to be independent.

Following basic econometric theory we have for the OLS estimator b: 𝑏𝑏 = 𝛽𝛽 + �𝑛𝑛 𝑋𝑋1 ′𝑋𝑋�−11

𝑛𝑛 𝑋𝑋′𝜀𝜀� Which makes it obvious that b is consistent if

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 �1𝑛𝑛 𝑋𝑋𝜀𝜀� = 0

These variables are said to be (weakly) exogenous. On the other hand the explanatory variables X are endogenous such that b is inconsistent if

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 �1𝑛𝑛 𝑋𝑋𝜀𝜀� ≠ 0

These variables are said to be endogenous.

This paper considers loglinear models of the form

𝑌𝑌𝑡𝑡 = exp{𝑥𝑥𝑡𝑡′𝛽𝛽} exp{𝑢𝑢𝑡𝑡}, 𝑢𝑢𝑡𝑡~𝐼𝐼𝐼𝐼𝐼𝐼(0, 𝜎𝜎2), 𝑡𝑡 = 1, … , 𝑇𝑇

Hence, after a logarithmic transformation: 𝐿𝐿𝐿𝐿𝐿𝐿(𝑌𝑌𝑡𝑡) = 𝑥𝑥𝑡𝑡′𝛽𝛽 + 𝑢𝑢𝑡𝑡

Which can be estimated by ordinary least squares (OLS). Under the standard Gauss-Markov assumptions it can be seen that:

𝐿𝐿𝐿𝐿𝐿𝐿(𝑌𝑌� = 𝑥𝑥𝑡𝑡) 𝑡𝑡′𝛽𝛽̂

Is the best linear unbiased estimator (BLUE).

(5)

However

𝑌𝑌� = exp {𝑥𝑥𝑡𝑡 𝑡𝑡′𝛽𝛽̂} ≠ 𝐸𝐸[𝑌𝑌]

due to both the distribution of the estimator and the random nature of the

disturbance term (Van Garderen, (2001)). A solution to this problem is introduced in that paper. By using moment generating functions and an inverse Laplace technique he was able to derive an exact unbiased predictor in the exponential model.

Theorem 1. The optimal unbiased predictor of Yp in the loglinear model is

𝑌𝑌�𝑝𝑝∗ = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂�0𝐹𝐹1�𝑝𝑝; 𝑝𝑝𝑧𝑧𝑝𝑝𝜎𝜎�2�,

Where m = (T-K)/2, 𝑧𝑧𝑝𝑝 =12(1 − 𝑎𝑎𝑝𝑝), and 𝑎𝑎𝑝𝑝 = 𝑥𝑥𝑝𝑝′(𝑋𝑋′𝑋𝑋)−1𝑥𝑥𝑝𝑝, and 0F1 a hypergeometric function.

Van Garderen compared this predictor with previously used predictors such as the so called deterministic predictor and the closed form predictor. He compared these predictors in terms of the mean squared prediction error and found some remarkable results. The deterministic predictor can be superior to the exact predictor of Theorem 1 in small sample sizes, but the closed form predictor is inferior to the exact predictor for all sample sizes. These results imply that for large sample sizes, the exact predictor is always superior to previously found predictors, if the variables X are all exogenous and the variance is distributed as 𝑢𝑢𝑡𝑡~IIN(0, σ2)

However, in practice, this might not always be the case. Therefore in this paper the situation in which X contains endogenous variables will be investigated. If the

normal loglinear model is considered, but the Xi en ui are mutually correlated,

IV-estimation can be used to find consistent estimators for β. In linear IV-estimation the following model is considered:

𝑦𝑦𝑖𝑖 = 𝑥𝑥𝑖𝑖′𝛽𝛽 + 𝑢𝑢𝑖𝑖

𝑥𝑥𝑖𝑖 = 𝑧𝑧𝑖𝑖′𝜋𝜋 + 𝑣𝑣𝑖𝑖

Where the rank(π)= k. We assume the presence of endogeneity and assume that the variance is homoskedastic.

�𝑢𝑢𝑣𝑣𝑖𝑖𝑖𝑖� ~𝐼𝐼(0, 𝛴𝛴)~ 𝐼𝐼(0, �𝜎𝜎𝜎𝜎𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 𝜎𝜎𝜎𝜎𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢� )

In this model it is obvious that there is a mutual correlation between Xi and ui,

that Xi and Zi are correlated and it is assumed that there is no correlation between Zi

and ui. This makes the instrumental variables relevant and valid.

When the variables X fail to be exogenous the general Gauss-Markov assumptions no longer apply. As a consequence the estimators for β will no longer be the best linear unbiased estimators (BLUE). The most commonly used technique to resolve

(6)

the problem of endogeneity is ‘IV estimation’. The use of instrumental variables is first introduced by Phillip and Sewall Wright (1928). In IV-estimation one generally uses a ‘two stage least squares’ regression to estimate the various coefficients.

In the two stage least squares regression used in this paper the following model is considered:

𝑌𝑌 = log(𝑦𝑦𝑡𝑡) = 𝑋𝑋𝛽𝛽 + 𝑢𝑢

Where Y : n x 1, X : n x k, β : k x 1 and u = n x 1

And a matrix of instruments is defined as Z : n x l. The two stage least square (2SLS) estimator is defined as:

𝛽𝛽̂2𝑆𝑆𝑆𝑆𝑆𝑆 = (𝑋𝑋′𝑍𝑍(𝑍𝑍′𝑍𝑍)𝑍𝑍′𝑋𝑋)−1𝑋𝑋′𝑍𝑍(𝑍𝑍′𝑍𝑍)−1𝑍𝑍′𝑌𝑌

Where 𝛽𝛽̂2𝑆𝑆𝑆𝑆𝑆𝑆 : k x 1.

Which can be written as:

𝛽𝛽̂2𝑆𝑆𝑆𝑆𝑆𝑆 = (𝑋𝑋′𝑃𝑃𝑧𝑧𝑋𝑋)−1𝑋𝑋′𝑃𝑃𝑧𝑧𝑌𝑌

Where Pz is the projection matrix:

𝑃𝑃𝑧𝑧 = 𝑍𝑍(𝑍𝑍′𝑍𝑍)−1𝑍𝑍′

To compare the derived predictors, the mean squared prediction error (MSPE) will be used. The MSPE is defined as:

𝑀𝑀𝑀𝑀𝑃𝑃𝐸𝐸 = 𝐸𝐸[�𝑌𝑌𝑝𝑝− 𝑌𝑌�𝑝𝑝∗�2]

If the true value of β would be known the predictor of Yp would be:

𝑌𝑌𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟 = 𝐸𝐸� 𝑌𝑌𝑝𝑝 ∣∣ 𝑥𝑥𝑝𝑝� = exp (𝛽𝛽′𝑥𝑥𝑝𝑝+ 12 𝜎𝜎2)

Which is a property of the lognormal distribution and follows from the moment generating function.

The theoretically lowest possible value for the MSPE is equal to: lim

𝑛𝑛→∞𝑀𝑀𝑀𝑀𝑃𝑃𝐸𝐸 = 𝑉𝑉[𝑌𝑌𝑝𝑝− 𝑌𝑌𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟]

3. Research design

In order to be able to do various test on the exponential model, a Monte Carlo simulation is used. First a simulation is made of the standard exponential model under the standard Gauss-Markov assumptions. These assumptions are necessary for the mathematical proof of the exact unbiased predictor derived by van Garderen (2001). After simulating this model it can be adjusted for various changes, such as

(7)

heterogeneity, serial correlation and in the case considered here: endogeneity. Since the mathematical proof of the exact unbiased predictor depends on the assumption

that Xi and ui are mutually uncorrelated, it is to be expected that the deviation of the

predictor of Yp from the “real” Yp will be higher in terms of the MSPE, if Xi contains

endogenous variables.

Van Garderen compared his exact unbiased predictor to previously found

predictors in terms of the MSPE. This paper will contribute to this previously written article in two ways. First of all, the relevance of this predictor will be tested if the variables X fail to be exogenous. Second, if the exact predictor turns out to be

relevant, the degree of endogeneity will be tested against the gain in MSPE.

The models used for generating the equations are:

𝑌𝑌𝑡𝑡 = exp{𝑥𝑥𝑡𝑡′𝛽𝛽} exp{𝑢𝑢𝑡𝑡}, 𝑢𝑢𝑡𝑡~𝐼𝐼𝐼𝐼𝐼𝐼(0, 𝜎𝜎2), 𝑡𝑡 = 1, … , 𝑇𝑇

𝑥𝑥𝑡𝑡= 𝑧𝑧𝑡𝑡′𝜋𝜋 + 𝑣𝑣𝑡𝑡

𝑣𝑣𝑡𝑡 = 𝑢𝑢𝑡𝑡′𝜌𝜌 + (1 − ρ)η𝑡𝑡, 𝑝𝑝𝑚𝑚𝑡𝑡 η𝑡𝑡 ~ 𝐼𝐼𝐼𝐼𝐼𝐼(0, 𝜎𝜎2)

The parameter ρ is introduced to create a single parameter that determines the

correlation between the variables xt and ut.

There are two parameters which can be of influence for the relevance of the predictor. The value of Π can be adjusted, or the value of ρ can be changed, or even both can be changed. In this research paper, the value of ρ will be changed.

The value of ρ determines the severity of the endogeneity of the variables X. If ρ is close to one, then the variables X are strongly endogenous. If ρ is close to zero the variables X are weakly endogenous.

In the paper written by Van Garderen (2001), several assumptions were made on the exact optimal prediction in loglinear models. One of these assumptions is the exogeneity condition on the variables X. For testing the influence of loosening this assumptions a hypotheses tree is used. The general hypotheses tested is: ‘the exact optimal predictor in the loglinear model is still usefull when the variables X are endogenous’.

There are several assumptions made about the matrix of instruments 𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 �1𝑛𝑛 𝑍𝑍𝑢𝑢� = 0

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 �1𝑛𝑛 𝑍𝑍𝑋𝑋� = 𝑄𝑄 𝑍𝑍𝑍𝑍

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝 �𝑛𝑛 𝑍𝑍1 ′𝑍𝑍� = 𝑄𝑄 𝑧𝑧𝑧𝑧

The first equation is called the exogeneity condition for Z. The second equation guarantees the relevance of the instruments. And the last condition is the stability condition for the matrix Z (e.g., Heij et al, 2004). Instruments can play a role in statistical analysis when X fails to be exogenous:

(8)

𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝(1𝑛𝑛 𝑋𝑋𝜀𝜀) ≠ 0

The purpose of this case study is to find out what the influence is of the

endogeneity of the variables X to the exact optimal predictor for y in the logarithmic model. Van Garderen (2001) proved that using a correction factor with the

hypergeometric function leads to an unbiased predictor in the loglinear model.

However it is not yet clear how well this correction factor works when the variables X are endogenous. As mentioned before the correction factor is based on a

hypergeometric function. Van Garderen (2001) derived this predictor for Y by making

use of the fact that 𝛽𝛽̂ 𝑎𝑎𝑛𝑛𝑎𝑎 𝜎𝜎�2 are independent. Therefore the density function

factorizes as solitary functions of 𝛽𝛽̂ 𝑎𝑎𝑛𝑛𝑎𝑎 𝜎𝜎�2 and can be solved in two steps. After

successfully estimating these functions the optimal exact unbiased predictor was found and defined as

𝑌𝑌�𝑝𝑝∗ = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂�0𝐹𝐹1�𝑝𝑝;𝑝𝑝2 (1 − 𝑥𝑥𝑝𝑝′(𝑋𝑋′𝑋𝑋)−1𝑥𝑥𝑝𝑝)𝜎𝜎�2�,

The term (X’X)-1 is based on the variance of the distribution of bOLS. However

since we are now interested in the predictor based on 𝛽𝛽̂𝐼𝐼𝐼𝐼 we have to consider the

distribution of bIV. Basic econometric theory states that in large enough finite samples

bIV is approximately distributed as

𝑏𝑏𝐼𝐼𝐼𝐼 ≈ 𝐼𝐼(𝛽𝛽, 𝜎𝜎2(𝑋𝑋′𝑃𝑃𝑧𝑧𝑋𝑋)−1) = 𝐼𝐼(𝛽𝛽, 𝜎𝜎2(𝑋𝑋�′𝑋𝑋�)−1)

Therefore the correction term has to be adjusted for the variance of the estimator. This results in the following predictor of Y:

𝑌𝑌�𝑝𝑝_𝐼𝐼𝐼𝐼∗ = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂𝐼𝐼𝐼𝐼�0𝐹𝐹1�𝑝𝑝;𝑝𝑝2 (1 − 𝑥𝑥𝑝𝑝′(𝑋𝑋�′𝑋𝑋�)−1𝑥𝑥𝑝𝑝)𝜎𝜎�2𝐼𝐼𝐼𝐼�,

Where m = (T-K)/2, and 0F1 a hypergeometric function.

This predictor will be compared to the standard deterministic predictor for Y: 𝑌𝑌�𝑝𝑝_𝐼𝐼𝐼𝐼 = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂𝐼𝐼𝐼𝐼�

In order to calculate the MSPE for the various predictors following from the two stage least squares regression the following operations are performed:

A matrix with instrumental variables is simulated 𝑍𝑍 ~ 𝐼𝐼(0, 𝜉𝜉2)

𝑛𝑛,𝑘𝑘

The matrix Π determines the correlation between X and Z. Here p = 0.9

(9)

𝛱𝛱 = �𝑝𝑝 ⋯ 0⋮ ⋱ ⋮

0 ⋯ 𝑝𝑝�

Generate a random error vector

𝑣𝑣 = 𝐼𝐼(0, 𝜎𝜎2) 𝑛𝑛,𝑘𝑘

Generate a matrix X which is correlated with Z and depends on v 𝑋𝑋 = 𝛱𝛱′𝑍𝑍 + 𝑣𝑣

ρ is changed for values for which 0 ≤ ρ ≤ 1. In this definition X is perfectly endogenous if ρ = 1 and perfectly exogenous if ρ = 0.

To make sure that X is endogenous; u depends on v

𝑣𝑣𝑡𝑡 = 𝑢𝑢𝑡𝑡′𝜌𝜌 + (1 − ρ)η𝑡𝑡, 𝑝𝑝𝑚𝑚𝑡𝑡 η𝑡𝑡 ~ 𝐼𝐼𝐼𝐼𝐼𝐼(0, 𝜎𝜎2)

Generate the true values of Y

𝑌𝑌 = exp {𝑋𝑋𝑏𝑏 + 𝑢𝑢} log(𝑌𝑌) = 𝑋𝑋𝑏𝑏 + 𝑢𝑢 First stage regression of 2SLS

𝑋𝑋� = 𝑍𝑍(𝑍𝑍′𝑍𝑍)−1𝑍𝑍′𝑋𝑋

Second stage regression of 2SLS

𝑏𝑏𝐼𝐼𝐼𝐼 = (𝑋𝑋�′𝑋𝑋�)−1𝑋𝑋�′log (𝑌𝑌)

Save residuals for calculation correction term 𝑚𝑚𝐼𝐼𝐼𝐼 = log(𝑌𝑌) − 𝑋𝑋𝑏𝑏𝐼𝐼𝐼𝐼

Predictor of Y in the loglinear model

𝑌𝑌�𝑝𝑝𝑟𝑟𝑟𝑟𝑝𝑝 = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂𝐼𝐼𝐼𝐼�

And with the correction introduced by Van Garderen (2001), adjusted for IV-estimation:

𝑌𝑌�∗

𝑝𝑝_𝐼𝐼𝐼𝐼 = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂𝐼𝐼𝐼𝐼�0𝐹𝐹1�𝑝𝑝; 𝑝𝑝;𝑝𝑝2 (1 − 𝑥𝑥𝑝𝑝′(𝑋𝑋�′𝑋𝑋�)−1𝑥𝑥𝑝𝑝)𝜎𝜎�2𝐼𝐼𝐼𝐼�,

Where m = (T-K)/2, and 0F1 a hypergeometric function.

The practice of the use of this newly defined predictor will be tested for various degrees of endogeneity; i.e. for various values of ρ.

(10)

4. Analysis

In order to do a depth analysis of the optimal predictor in the loglinear model it is necessary to generate the ‘normal’ model first. To verify the results of Van Garderen (2001) the mean squared prediction errors are compared of the predictors with and without the error correction term of the hyper geometric function. In the analysis a model is used with

#variables =5

#replications = 100000

In figure 1 the MSPE’s of the predictors for Y are shown as a function of σ2 and T.

It can be seen that for a higher number of observations, the difference in MSPE

increases. That means that introducing the correction factor is especially useful when there is a large number of observations. These results are equal to the results found earlier by van Garderen (2001).

Figure 1. 3-D Impression of the differences in MSPE with various values of T and σ with- and without the correction term as introduced by van Garderen (2001).

Furthermore it can be seen that for higher values of σ, implementing the correction term leads to a more severe reduction in MSPE.

When the variables are endogenous an IV-estimation has to be done in order to make the estimators consistent. However with the loglinear model, the smallest deviation from the real value of b leads to a severe bias of the estimator. Since the variance in the estimator β increases from σ2(X’X)-1 to σ2(X’ Pz X)-1 the true values of b

are reduced from 1 to 0.2 to get results which are still within range of the true value of y. Since any deviation from the true value of b results in a exp{Xb} deviation from the prediction, it is inevitable to make these adjustments.

0,1 0,2 0,30,4 0,50,6 0,70,8 0,9 0 0,1 0,2 0,3 0,4 0,5 10 25 50 100 500 σ M SP E( Ydet ) M SP E( Yp ) T ->

Difference in MSPE with - without correction

term

0,4-0,5 0,3-0,4 0,2-0,3 0,1-0,2 0-0,1 9

(11)

To test the general hypotheses that the correction factor is still useful in IV-estimation of the loglinear model the models are tested for various amounts of

observations (with σ2 = 0.5 and ρ = 0.8). The correlation matrix between X and Z is set

to Π = 0.9*Ik. The concentration parameter equals:

𝜇𝜇 =𝛱𝛱′𝑍𝑍𝜎𝜎′𝑍𝑍𝛱𝛱

𝑢𝑢

For various amounts of observations the MSPE is calculated and the results are summarized in Table 1 and Figure 2.

MSPE in 2SLS

#obs 100 250 500 1000 μ 35 85 180 354 E[Y-YpIV] 0,2238 0,3608 0,32 0,3805 E[(Y-YpIV)2] 2,2005 2,4106 2,4016 2,3755 E[(Y-Ydet)2] 2,5188 2,7612 2,7545 2,7288 V[(Y-Yreal)] 2,1874 2,3407 2,3097 2,2769

Table 1. Summary of MSPE in the two-stage least squares model with various amounts of

observations. Values shown in this table are respectively: the bias of the predictor, the MSPE of the predictor derived by van Garderen adjusted for IV-estimation, the MSPE of the deterministic predictor and the MSPE of the theoretical lowest value of MSPE.

Figure 2. Graphic representation of the MSPE’s of respectively: the predictor derived by van Garderen adjusted for IV-estimation, the deterministic predictor and the MSPE of the theoretical lowest value

of MSPE. 0 0,5 1 1,5 2 2,5 3 0 200 400 600 800 1000 1200 M SP E T ->

MSPE in 2SLS

E[(Y-Yp)2] E[(Y-Ydet)2] V[(Y-Yreal)] 10

(12)

From Table 1 it can be seen that the predictor is no longer unbiased. Even when

the amount of observations go to 1000, the E[Y-Yp] ≠ 0. However, this predictor does

have a smaller MSPE than the deterministic predictor. From Figure 2 it can be seen that this predictor is significantly closer to the theoretically lowest value of MSPE than the deterministic predictor. This confirms the hypotheses that for the given values of ρ and σ, adding the correction factor is indeed an improvement of the predictor, in terms of the mean squared prediction error.

The second goal of this research paper, is to investigate the influence of the severity of endogeneity to the benefit of the correction term. Therefore a generation is made with a fixed σ (σ = 0.5) and a fixed number of observations (T = 500), but a varying parameter ρ, which is defined such that it determines the magnitude of endogeneity of the variables X. The results are shown in Figure 3.

Figure 3. Graphic illustration of the gain in MSPE when the severity of the endogeneity of the variables X increases.

In Figure 3 it can be seen that for higher values of ρ, the difference between the derived predictor and the deterministic predictor increases, in terms of the MSPE. In tests with a higher number of observations the same results were measurable.

Furthermore the derived predictor is much closer to the theoretically lowest possible value of MSPE. This suggests that for high levels of endogeneity, this

predictor, although it is not unbiased, still gives a more accurate prediction than the deterministic predictor. 0 0,5 1 1,5 2 2,5 3 3,5 4 0 0,2 0,4 0,6 0,8 1 M SP E ρ

MSPE in 2SLS with varying ρ

E[(Y-Yp)2] E[(Y-Ydet)2] V[(Y-Yreal)]

(13)

5. Conclusion

In this paper a case study is done, based on the previous work done by van Garderen (2001). In his article he derived an exact unbiased predictor in loglinear models based on a hypergeometric function. The predictor derived by van Garderen is defined as follows:

𝑌𝑌�𝑝𝑝∗ = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂�0𝐹𝐹1�𝑝𝑝; 𝑝𝑝𝑧𝑧𝑝𝑝𝜎𝜎�2�,

Where m = (T-K)/2, 𝑧𝑧𝑝𝑝 =12(1 − 𝑎𝑎𝑝𝑝), and 𝑎𝑎𝑝𝑝 = 𝑥𝑥𝑝𝑝′(𝑋𝑋′𝑋𝑋)−1𝑥𝑥𝑝𝑝, and 0F1 a hypergeometric

function.

This predictor was derived under various assumptions on the variables X. The purpose of this article is to investigate what the added value of this correction term is, when one of the assumptions is loosened; the exogeneity condition on X.

A Monte Carlo simulation has been used to simulate the loglinear models. First the results of van Garderen has been reconstructed. As suspected the newly defined predictor is indeed a more accurate predictor than the commonly used predictor

without correction term 𝑌𝑌�𝑝𝑝∗ = exp�𝑥𝑥𝑝𝑝′𝛽𝛽̂�, especially when the amount of observations

increase.

When the variables X fail to be exogenous, the Gauss-Markov assumptions no longer apply and the OLS estimator fails to be the best linear unbiased estimator (BLUE). Therefore IV-estimation is used to derive consistent estimators for β. An instrumental variable matrix Z is used to do a two stage least squares regression (TSLS) to derive consistent estimators. However, since the correction term, as

introduced by van Garderen (2001), largely depends on the variance of the estimator,

the distribution of the bTSLS is used to derive a new correction term; analogue to the

correction term as previously derived. The predictor for Y with the new correction term is defined as:

𝑌𝑌�∗

𝑝𝑝_𝐼𝐼𝐼𝐼 = exp�𝑋𝑋𝛽𝛽̂𝐼𝐼𝐼𝐼�0𝐹𝐹1�𝑝𝑝; 𝑝𝑝;𝑝𝑝2 (1 − 𝑥𝑥𝑝𝑝′(𝑋𝑋�′𝑋𝑋�)−1𝑥𝑥𝑝𝑝)𝜎𝜎�2𝐼𝐼𝐼𝐼�,

Where m = (T-K)/2, and 0F1 a hypergeometric function.

This paper makes two contributions to previous work. First of all it has been shown that the MSPE of this predictor is systematically lower than the predictor 𝑌𝑌�𝑝𝑝_𝐼𝐼𝐼𝐼 = exp�𝑋𝑋𝛽𝛽̂𝐼𝐼𝐼𝐼�.

Secondly the influence of the degree of endogeneity has been investigated. It turns out that for low levels of endogeneity the MSPE of the predictor with correction term is smaller than without correction term, thought only little. However when the variables X are strongly endogenous there is a large difference in the MSPE of the derived predictor and the deterministic predictor. The newly derived predictor is far

(14)

more accurate that the deterministic predictor, and this effect increases when the severity of endogeneity increases.

It is concluded that if variables X are endogenous, adding the correction factor to the predictor makes the predictor more accurate. When the degree of endogeneity increases, the necessity of using the correction term increases as well. It strongly reduces the gain in MSPE and stays much closer to the theoretical lowest possible value of the MSPE.

6. Discussion

In this research paper the influence of the degree of endogeneity is compared to predictors for Y in the loglinear model. The exact optimal unbiased predictor derived by van Garderen (2001) was derived under various assumptions and in this research only the exogeneity condition has been loosened. Further research should determine how good this correction term works, if one of the other Gauss-Markov assumptions is loosened; such as the assumption of homogeneity and the assumption that there is no serial correlation present.

Within the scope of loosening the assumption of exogeneity further research can be done to determine the influence of the use of weak variables. In this paper a correlation matrix Π is used, with a value of 0.9 on the diagonal. This implies a

correlation of 0.9 between all the variables Xi and Zi. This assumption is rather strong

and might not be realistic for real data. Previous research showed that finding strong instruments can be a great challenge for specific datasets. Therefore it would be

interesting to determine the added value of the correction term when the instruments fail to have a high level of correlation with the variables X.

Finally further research could be done to determine an exact unbiased predictor in loglinear IV-estimation. In this paper it is shown that the exact unbiased predictor in the loglinear model with OLS estimation is useful in IV-estimation, however an exact solution for the predictor when the variables are endogenous in the loglinear model is yet to be derived.

(15)

References

Angrist, J. D., & Keueger, A. B. (1991). Does compulsory school attendance affect schooling and earnings?. The Quarterly Journal of Economics, 106(4), 979-1014.

Hahn, J., & Hausman, J. (2002). A new specification test for the validity of instrumental variables. Econometrica, 70(1), 163-189.

Hahn, J., & Hausman, J. (2005). Estimation with valid and invalid instruments. Annales

d'Économie et de Statistique, 25-57.

Hausman, J. A. (1978). Specification tests in econometrics. Econometrica: Journal of the

Econometric Society, 1251-1271.

Heij, C., De Boer, P., Franses, P. H., Kloek, T., & Van Dijk, H. K. (2004). Econometric methods

with applications in business and economics. Oxford University Press.

Rothenberg, T. J. (1983), “Asymptotic Properties of Some Estimators in Structural Models,” in (ed. by S. Karlin, T. Amemiya, and L. Goodman), Studies in Econometrics,

Time Series, and Multivariate Statistics. Orlando: Academic Press.

Van Garderen, K. J. (2001). Optimal prediction in loglinear models. Journal of econometrics,

104(1), 119-140.

Referenties

GERELATEERDE DOCUMENTEN

Table 4: Average of Type I and Type II error rates resulting from the 95% credible and confidence intervals for ρ based on using the flat prior (F), Jeffreys rule prior

Which factors affect the required effort in the development phase of a Portal project and how can these factors be applied in a new estimation method for The Portal Company.

Dans la métope droite, figure indistincte : (Diane?). Amour; Oswald, Fig.. 39: Vespasien-Domitien); l'emploi de grands médaillons indique ce- pendant une période plus

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of

Hier zal men bij de verbouwing dan ook de Romaanse kerk op uitzondering van de toren afgebroken hebben en een nieuwe, driebeukige kerk gebouwd hebben ten oosten van de toren.

Seeking to navigate and explore diasporic identity, as reflected in and by transatlantic narrative spaces, this thesis looks to three very different novels birthed out of the Atlantic

De punten liggen op normaalwaarschijnlijkheids papier vrijwel op een rechte lijn, dus de tijden zijn normaal

Wij hopen dat ouderen en mantelzorgers die deelgenomen hebben in 2015, het jaar daarop ook weer mee willen doen of bereid zijn eventuele opvolgers voor de studiejaren daarna te