Prediction from the one-way error components regression model : forecasting lottery sales and housing prices

(1)

0

Prediction from the one-way error components

regression model:

Forecasting lottery sales and housing prices

BSc Econometrics and Operational Research University of Amsterdam

Supervisor: Andrew Pua

David Bultsma 10200800 BSc Thesis

(2)

0

1. Introduction

1 2. Theoretical setting

3 2.1 Best linear unbiased prediction in the generalized linear 3

regression model

2.2 The panel data regression model

4 2.3 Prediction from the regression model with one-way error 5

components

3. Research setup

9 3.1 Lottery sales

9 3.2 Housing prices

11 4. Results

13 4.1 Lottery sales

13 4.2 Housing prices

16 5. Concluding remarks

19 6. References

20

(3)

1

1. Introduction

Forecasting of sales is an important task for every company. In order to know your position in the market or to make decisions concerning how much production and advertising is required for the future, forecasting is absolutely necessary in order to provide managers with information for making their decisions. Accurate forecasting also helps in the budget planning process. After collecting data on different variables potentially related to sales, different models and predictors can be constructed to predict future sales. One could use data from a cross-section or choose a time-series approach.

For this thesis I use panel data to forecast future lottery sales and housing prices. The difference between a panel data regression and a regular time-series or cross-section

regression is that the variables are observed over two dimensions, meaning that the linear regression model can be written as:

(1.1)

With i denoting firms, individuals, or any cross-sectional unit and t denoting time. The scalar varies according to subject, is , is the ith observation on K explanatory

variables and is the disturbance.

A cross-sectional or simple time-series approach alone does not make full use of the available data. Baltagi (1995) lists several benefits of using panel data. We can control for time-invariant variables better whereas for a time-series study or a cross-section study this is not possible at all. Omission of these variables may lead to bias in the resulting estimates. Another benefit is that panel data allows us to identify and measure effects that are not

detectable in pure cross-section or pure time-series data. A third benefit is that panel data may provide more information on data, more variability, less collinearity among the variables, more degrees of freedom and more efficiency.

Several predictors in different linear regression models have been investigated. Goldberger (1962) derives the best linear unbiased predictor in the generalized regression model. Baillie and Baltagi (1999) investigate the efficiency of alternative predictors in the context of the regression model with one-way error components. They compare a number of predictors including models estimated by OLS and fixed effects (FE) and find that forecasts based on FE are preferred. Fiebig and Johar (2014) provide further comparisons of alternative predictors in the context of risk prediction in health economics. Frees and Miller (2004)

(4)

2 consider forecasting the sales of state lottery tickets from fifty postal (ZIP) codes in

Wisconsin. They analyze lottery sales over a 40-week period, April 1998 through January 1999. They find that an error components model with an AR(1) term performs best in their forecasting exercise.

For this thesis two data sets are used. The first data set is the same as is used by Frees and Miller (2004). Since all variables in this data set are constant over time, a second data set as discussed by Frees (2004) is used of which the variables show more variation over time. This data set contains annual observations of housing prices and explanatory variables from 36 metropolitan statistical areas (MSAs) over the nine-year period 1986-1994. Several

predictors previously investigated by Baillie and Baltagi (1999) are considered and applied to forecast the sale of lottery tickets and housing prices. The aim is to find the best predictor for these two applications.

The rest of this thesis is organized as follows. Section 2 contains a discussion of the predictors used for forecasting. Section 3 contains a discussion of the datasets used and how the performance of the predictors is evaluated. Results are presented in section 4. Section 5 gives concluding remarks.

(5)

3

2. Theoretical setting

This section discusses the derivation of the best linear unbiased predictor (BLUP) in the generalized linear regression model by Goldberger (1962), the one-way error component regression model as discussed by Baltagi (1995) as well as several predictors that have been investigated by Baillie and Baltagi (1999) and are used for this thesis.

2.1 Best linear unbiased prediction in the generalized linear regression model

Goldberger (1962) derived the BLUP in the generalized linear regression model. The generalized linear regression model and corresponding assumptions are given by:

(2.1)

(2.2)

(2.3)

with a vector of regressand observations, a matrix of regressor observations of rank K, β a vector of regression coefficients, a vector of disturbances and the positive-definite variance-covariance matrix of disturbances.

When interdependence of the disturbances is present in a regression model

(assumption 2.3), the pattern of sample residuals contains information which is useful for prediction. The BLUP derived by Goldberger (1962) takes advantage of this interdependence.

An actual drawing of the regressand given the vector of regressors is given by:

(2.4)

With the scalar value of the regressand, the vector of prediction regressors

and the scalar value of the prediction disturbance.

In view of the interdependence of disturbances in the sample, it is in general not reasonable to assume that the prediction disturbance is independent of the sample

disturbances. Goldberger (1962) gives the following set of assumptions which allows the prediction disturbance to be correlated with the sample disturbances:

(2.5)

(2.6)

(2.7)

With the vector of covariances of the prediction disturbance with the vector of sample disturbances.

(6)

4 The goal is to derive the best linear unbiased predictor of , that is, a predictor of the form (2.8) With a vector of constants, such that the prediction variance is at a minimum and the linear predictor is unbiased. The problem to be solved can therefore be written as:

Minimize (2.9)

Subject to (2.10)

Using (2.1), (2.4) and (2.8) we can write

(2.11) Because of the unbiasedness condition (2.10), it is required that the first term on the right is equal to zero. Using (2.3), (2.6) and (2.7) the prediction variance can be written as:

(2.12) This is a constrained minimization problem that is solved by Lagrange's method. To minimize (2.12) subject to (2.10), the Lagrange function is

(2.13) Differentiation with respect to and , with a vector of Lagrangian multipliers, and substituting this in (2.8) we find

_(2.14)

With the vector of sample residuals from the generalized least squares regression and the generalized least squares estimator of β.

2.2 The panel data regression model

The variables in a panel data regression model are observed over two dimensions. In the paper of Baillie and Baltagi (1999), the model is written as:

(2.15)

with i denoting the cross-section dimension, for instance firms, individuals, countries, or any cross-sectional unit and t denoting the time-series dimension. The dimension of β is and is the ith observation on K explanatory variables. The disturbances are assumed to follow a one-way error component model:

(7)

5 where denotes the unobservable individual specific effect and is assumed to be NID(0, ).

denotes the remainder disturbance and is also assumed to be NID(0, ). is

time-invariant and accounts for any individual specific effect that is not included in the regression. The remainder disturbance varies with the cross-sectional unit and time and can be thought of as the usual disturbance in the regression. The variance-covariance matrix of the disturbances can be written as

(2.17)

where , is an identity matrix of dimension N and is a

vector of ones of dimension T. is the notation for the Kronecker product.

2.3 Prediction from the regression model with one-way error components

Baillie and Baltagi (1999) apply the BLUP as derived above to the one-way error components case, which is a special case of the generalized linear regression model. They compare the efficiency of this predictor with three alternative predictors and derive the asymptotic mean squared error (AMSE) for the predictors. The four predictors that are considered are:

(1) An ordinary predictor based on the form of the optimal predictor, but with MLEs replacing population parameters.

(2) A truncated predictor that ignores the error component correction but uses MLEs for its regression parameters.

(3) A misspecified predictor which uses OLS estimates of the regression parameters. (4) A fixed effects predictor which assumes that the individual effects are fixed

parameters that can be estimated.

From Goldberger (1962) the form of this optimal predictor, with known parameters ( ) and known future exogenous variables , is given by

i (2.18)

where is the optimal prediction of the ith component at time T with an s step ahead forecast horizon, and

(8)

6

(2.19)

(2.20)

For the case where the true variance components are known, Wansbeek and Kapteyn (1978) and Taub (1979) have applied Goldberger's (1962) result to show that the BLUP for the ith individual s periods ahead is of the form

(2.21) where is the GLS estimator of β based on the true variance components and is the average of the GLS residuals over time for the ith individual.

where and X is the NT x K matrix with the ith row of the same form as defined for y. _{is of the form}

(2.22)

where and

is the identity matrix of dimension N and is a vector of ones of dimension T.

A feasible version of the ordinary predictor is obtained by substituting the MLEs of β and θ into (2.18) and therefore becomes

(2.23) where and and are the MLEs of and .

The prediction asymptotic mean squared error (AMSE) can be defined as

(2.24)

where

(2.25)

(9)

7

(2.27)

_(2.28)

The truncated predictor uses MLEs for its regression parameters, but ignores the contribution of autocorrelation to the predictor. The predictor is of the form

(2.29) The AMSE can be defined as

(2.30)

The predictor based on inefficient OLS estimates of the regression parameters is of the form

(2.31)

where is the least squares estimator. The autocorrelated error components are ignored in

both estimation and formation of the predictor. The extra term in (2.18) is ignored by this predictor.

The AMSE can be defined as

and

For the fixed effects estimator it is assumed that the s as defined in (2.16) are fixed parameters to be estimated. The predictor is of the form

(2.32) where _and

with and the averages over time of and respectively and Q as defined for (2.22). The AMSE can be defined as

(10)

8

(2.33)

(11)

9

3. Research setup

For this thesis the different predictors for the one-way error component model as discussed in section 2 are applied to predict the sales of lottery tickets in Wisconsin and housing prices in US metropolitan areas. After the predictors are used for making forecasts they are compared to see which one performs best. This section discusses the data sets that are used.

3.1 Lottery sales

State of Wisconsin lottery administrators provided weekly lottery sales data. The data set consists of the lottery sales over a 40-week period, April, 1998 through January 1999, from 50 ZIP codes within the state of Wisconsin. The lottery tickets are generally priced at $1.00 so the number of tickets sold is equal to the lottery revenue. There are a lot of different explanatory variables that affect sales volume that can be used for developing models for lottery sales.

The data set contains the following variables:

- Online lottery sales to individual consumers. (OLSALES)

-Number of listed retailers. (NRETAIL)

- Persons per household. (PERPERHH)

- Median years of schooling. (MEDSCHYR)

- Median home value in $1000s for owner-occupied homes. (MEDHVL) - Percent of housing that is renter occupied. (PRCRENT) - Percent of population that is 55 or older. (PRC55P)

- Household median age. (HHMEDAGE)

- Estimated median household income in $1000s. (MEDINC)

- Population in thousands. (POPULATN)

(12)

10 Before estimation, the variables PERPERHH, MEDSCHYR, MEDHVL and MEDINC are multiplied by 10.

A higher number of retailers could for instance lead to higher sales. People who are less educated and who have a lower income could be likely to buy more lottery tickets than highly educated people who have a higher income. When lottery jackpots increase, the tickets sales could increase as well. Around weeks 8 and 18, the prize of one online game,

PowerBall, grew to an amount higher than $100 million. The lottery sales showed a large increase in these weeks as can be seen in the following figure containing the first 25 ZIP codes.

Figure 1. Time series plot of logarithmic sales in the first 25 ZIP codes. Frees and Miller (2004) use the first 35 weeks for estimation and the remaining 5 weeks to assess the validity of model forecasts. For this thesis, the data are divided into two subsamples: a sample of the first 17 weeks, including the remarkable increase in week 8, to estimate model parameters, and a sample of week 18 that is used for prediction with the estimated model.

Two statistics are used to evaluate how well the four different predictors are able to forecast the big increase in sales in week 18. These are the mean absolute error (MAE) and

4 6 8 10 12 L N Z O L SAL ES 0 10 20 30 40 Week number

(13)

11 the mean absolute percentage error (MAPE). When week 18 is used for prediction, the MAE and MAPE can be defined as follows:

(3.1)

(3.2)

3.2 Housing prices

This data set is discussed by Frees (2004). The prices of houses are influenced by demand-side factors and supply-side factors. Demand-side factors are factors such as income and demographic variables. Supply-side factors such as the regulatory environment of a metropolitan area may also be important.

The data consist of annual observations from 36 metropolitan statistical areas (MSAs) over the nine-year period 1986-1994. The response variable is NARSP, an MSA’s average sale price based on transactions reported through the Multiple Listing Service, National

Association of Realtors. The response variable is the logarithm of the average sale price. The following data are used:

Response variable

- NARSP. An MSA's average sale price, in logarithmic units. It is based on transactions reported through the Multiple Listing Service.

Demand side explanatory variables

- PERYPC. Annual percentage growth of per capita income. - PERPOP. Annual percentage growth of population.

Supply side explanatory variables

- REGTEST. Regulatory index constructed by Malpezzi (1996).

- RCDUM. Rent control dummy variable constructed by Malpezzi (1996).

- SREG1. Sum of American Institute of Planners state regulatory questions regarding use of environmental planning and management. Additive index with a range from 0 to 8

constructed by Malpezzi (1996).

- AJPARK. Indicates whether the MSA is adjacent to a coastline.

- AJWTR. Indicates whether the MSA is adjacent to one or more large parks, military bases or reservations.

(14)

12 The following figure shows the response (NARSP) over time for the first 18 MSAs.

Figure 2. Time series plot of NARSP for the first 18 MSAs.

Again the data are divided into two subsamples: a sample of the first 8 years to estimate model parameters and a sample of year 9 that is used for prediction with the estimated model. The performance is evaluated by comparing the MAE and MAPE for the four predictors. The MAE and MAPE can be defined as follows:

(3.3) (3.4) 4 4 .5 5 5 .5 N AR SP 0 2 4 6 8 10 Year number

(15)

13

4. Results

4.1 Lottery sales

The response is logarithmic sales. Estimates are based on the first 17 weeks and are used to predict week 18. Since all explanatory variables do not change over time, these will be dropped for the fixed effects predictor. Based on the MAE, the ordinary predictor performs best and based on the MAPE the OLS predictor performs best.

Explanatory variables, MAE, MAPE

OLS predictor Truncated predictor Ordinary predictor Fixed effects predictor NRETAIL 0.024 (0.006) 0.038 (0.019) 0.038 (0.019) - PERPERHH -0.110 (0.024) -0.104 (0.075) -0.104 (0.075) - MEDSCHYR -0.080 (0.010) -0.073 (0.032) -0.073 (0.032) - PRCRENT 0.031 (0.006) 0.032 (0.018) 0.032 (0.018) - PRC55P -0.074 (0.020) -0.074 (0.063) -0.074 (0.063) - HHMEDAGE 0.126 (0.032) 0.125 (0.098) 0.125 (0.098) - POPULATN 0.056 (0.009) 0.037 (0.028) 0.037 (0.028) - MEDHVL 0.001 (0.000) 0.002 (0.001) 0.002 (0.001) - MEDINC 0.004 (0.001) 0.004 (0.002) 0.004 (0.002) - CONSTANT 13.543 (2.018) 12.337 (6.260) 12.337 (6.260) 7.864 (0.020) MAE 1.371 1.370 1.366 1.372 MAPE 14.432 14.445 14.996 15.086 - - 0.953 -

Table 1. Estimates of the parameters, MAE and MAPE for each predictor. Estimates are based on the first 17 weeks. The response is natural logarithmic sales.

The following figures show the forecasts, real values and forecast intervals (95%) for each ZIP code. The forecast intervals are of the form

, with AMSE the asymptotic mean squared error of each predictor as defined in section 2.3. The unknown parameters are,

(16)

14 A red dot indicates the forecast value and a green dot indicates the actual value observed.

Figure 3. OLS predictor. Forecast intervals, forecast value and actual value of logarithmic sales for each ZIP code.

Figure 4. Truncated predictor. Forecast intervals, forecast value and actual value of logarithmic sales for each ZIP code.

(17)

15 Figure 5. Ordinary predictor. Forecast intervals, forecast value and actual value of

logarithmic sales for each ZIP code.

Figure 6. Fixed effects predictor. Forecast intervals, forecast value and actual value of logarithmic sales for each ZIP code.

(18)

16 4.2 Housing prices

Estimates are based on the first 8 years and are used to predict year 9. Since all explanatory variables, except for PERYPC and PERPOP, do not change over time these will be dropped for the fixed effects predictor. Based on the MAE and MAPE the ordinary predictor performs best.

Explanatory variables, MAE, MAPE

OLS predictor Truncated predictor Ordinary predictor Fixed effects predictor PERYPC -0.026 (0.006) -0.025 (0.003) -0.025 (0.003) -0.025 (0.003) PERPOP 0.017 (0.010) -0.005 (0.006) -0.005 (0.006) -0.007 (0.006) REGTEST 0.034 (0.004) 0.035 (0.011) 0.035 (0.011) - RCDUM 0.222 (0.049) 0.194 (0.117) 0.194 (0.117) - SREG1 0.057 (0.009) 0.059 (0.022) 0.059 (0.022) - AJPARK 0.143 (0.043) 0.157 (0.104) 0.157 (0.104) - AJWTR 0.061 (0.029) 0.055 (0.071) 0.055 (0.071) - CONSTANT 3.593 (0.091) 3.598 (0.217) 3.598 (0.217) 4.583 (0.016) MAE 0.181 0.183 0.137 0.139 MAPE 3.850 3.895 2.986 3.009 - - 0.957 -

Table 2. Estimates of the parameters, MAE and MAPE for each predictor. Estimates are based on the first 8 years. The response is NARSP.

The following figures show the forecasts, real values and forecast intervals for each MSA. Again a red dot indicates the forecast value and a green dot indicates the actual value.

(19)

17 Figure 7. OLS predictor. Forecast intervals, forecast value and actual value of NARSP for

each MSA.

Figure 8. Truncated predictor. Forecast intervals, forecast value and actual value of NARSP for each MSA.

(20)

18 Figure 9. Ordinary predictor. Forecast intervals, forecast value and actual value of NARSP

for each MSA.

Figure 10. Fixed effects predictor. Forecast intervals, forecast value and actual value of NARSP for each MSA.

(21)

19

5. Concluding remarks

For this thesis, the OLS predictor, truncated predictor, ordinary predictor and fixed effects predictor as discussed by Baillie and Baltagi (1999) were used for forecasting lottery sales and housing prices. For the lottery sales application the sales in weeks 8 and 18 were remarkably high and unusual when compared to the sales in the other weeks. The aim was to determine which predictor is able to forecast the big increase in week 18 best, when the previous increase during week 8 was taken into account. Evaluation of the performance of the four predictors was done by comparing the MAE and MAPE of each predictor and the

predictor for which these statistics obtained a minimum would have performed best. Based on the MAE the ordinary predictor is best for forecasting the sales in week 18, while the OLS predictor would have performed best based on the MAPE. The forecast intervals for the OLS predictor are wider than the forecast intervals for the ordinary predictor.

For the housing prices application the first 8 years were used for estimation and year 9 was used for forecasting. Based on the MAE and MAPE the ordinary predictor performs best.

(22)

20

6. References

Baillie R.T., Baltagi B.H. Prediction from the regression model with one-way error

components. In Analysis of panels and limited dependent variable models, Hsiao, C, Lee LF, Pesaran, H (eds.), Cambridge University Press, 1999, 225-267.

Baltagi, B.H. (1995), Econometric Analysis of Panel Data, Chichester: John Wiley and Sons. Fiebig, DG and M Johar (2014). Forecasting with micro panels: The case of health care costs.

Frees, E. (2004). Longitudinal and Panel Data: Analysis and Applications in the Social Sciences. New York: Cambridge University Press.

Frees, E.W., & Miller, T.W. (2004). Sales forecasting using longitudinal data models. International Journal of Forecasting, 20, 99-114.

Goldberger, A.S. (1962), "Best Linear Unbiased Prediction in the Generalized Linear Regression Model," Journal of The American Statistical Association, 57: 369-375. Malpezzi, S. (1996). Housing prices, externalities, and regulation in U.S. metropolitan areas. Journal of Housing Research 7(2), 209–41.

Taub, A.J. (1979), "Prediction in the Context of the Variance-Components Model," Journal of Econometrics, 10: 103-107.

Wansbeek, T. and A. Kapteyn (1978), "The Separation of Individual Variation and Systematic Change in the Analysis of Panel Data," Annales de l'INSEE, 30-31: 659-680.

Prediction from the one-way error components regression model : forecasting lottery sales and housing prices