Double selection approach for inference about model parameters in the case of high-dimensional data

(1)

Faculty of Economics and Business, Amsterdam School of Economics

3rd Year BSc Econometrics and Operational Research

2nd Semester Bachelor Graduation Seminar Econometrics

Tutor: Dr. N.P.A. van Giersbergen Bachelor Thesis

Double selection approach for inference about model

parameters in the case of high-dimensional data

Rik van der Woerdt – 10672761

25 June 2016

Abstract

Data are generated at an unprecedented pace. As a result, high-dimensional data sets, data sets with a large number of variables relative to the number of observations, are becoming increasingly common. In search of statistical techniques that allow for reliable inference in the case of unknown economic relationships, which high-dimensional data sets likely contain, Belloni, Chernozhukov and Hansen designed the double selection procedure. This study applies their procedure to the work of Collins and Shester on the effects of the United States urban renewal program of 1949. The results show that their model specification based on instrumental variables (IV) is possibly incorrect, suffering from overfitting and omission of non-linear effects. It is concluded that the double selection procedure’s main strength comes from its ability to find and include non-obvious variables in the structural model.

(2)

Verklaring eigen werk

Hierbij verklaar ik, Rik van der Woerdt, dat ik deze scriptie zelf geschreven heb en dat ik de volledige verantwoordelijkheid op me neem voor de inhoud ervan.

Ik bevestig dat de tekst en het werk dat in deze scriptie gepresenteerd wordt origineel is en dat ik geen gebruik heb gemaakt van andere bronnen dan die welke in de tekst en in de referenties worden genoemd.

De Faculteit Economie en Bedrijfskunde is alleen verantwoordelijk voor de begeleiding tot het inleveren van de scriptie, niet voor de inhoud.

(3)

1 Introduction

According to IBM (2016), ninety percent of the data in the world today was created during the last two years. With so much data available, it is not unreasonable to assume that it contains information on relationships of which we, humans, are not yet aware. It is therefore desirable to have efficient algorithms to determine these relationships accurately. Traditional estimation techniques, such as ordinary least squares (OLS), assume that the researcher knows the exact functional form of the relationship and is solely interested in the sizes and signs of the parameters. However, when the relationship is unknown and a researcher has access to many potential explanatory variables, possibly even more than the number of observations, he or she is dealing with high-dimensional data and the traditional techniques no longer suffice.

Belloni, Chernozhukov and Hansen (2014a) give two reasons why existing statistical techniques are insufficient in this case. Firstly, they argue that standard estimation techniques, such as OLS, will lead to poor forecasts due to overfitting. Secondly, they claim that other statistical methods, specifically designed for dealing with high-dimensional data, tend to do a good job at prediction, but lead to incorrect conclusions about regression coefficients due to variable selection errors (Belloni, Chernozhukov, & Hansen, 2014a, p. 30). In response to these issues, they “provide an overview of innovations in data mining1 which can be modified to provide high-quality inference about model parameters”. This paper zooms in on their most generic modification, the double selection procedure.

As the statement above suggests, Belloni, Chernozhukov and Hansen (2014a) designed the double selection procedure to provide inference about model parameters, when the true economic relationship is unknown. The procedure, in simplified terms, is an adapted instrumental variables (IV) model, using the Least Absolute Shrinkage and Selection

1

Definition of “data mining”: a principled search for “true” predictive power that guards against false discovery and overfitting, does not erroneously equate in-sample fit to out-of-sample predictive ability, and accurately accounts for using the same data to examine many different hypotheses or models (Belloni, Chernozhukov, & Hansen, 2014a, p. 30).

(5)

Operator (LASSO) to reduce the dimensions of both the control variables and instruments.2 If the technique indeed enables reliable inference about model parameters in the case of unknown economic relationships, the double selection approach significantly contributes to the possibility of analysing high-dimensional data sets.

To assess the usefulness and accuracy of the double selection method, two steps are taken. As a first step, the results of one of the empirical examples used by Belloni, Chernozhukov and Hansen (2014a), Acemoglu, Johnson and Robinson’s (2012) research, are reproduced and, where appropriate, enhanced. As a second step, the double selection procedure is applied to another empirical study where instrumental variables were originally used, ‘Slum Clearance and Urban Renewal in the United States’, by Collins and Shester (2013). This study therefore answers the question: To what extent can the double selection

procedure provide additional insights about the relationship between federal development programs and local prosperity, originally estimated by Collins and Shester (2013)?

The following chapter provides additional details about the double selection approach and gives further information on the two empirical case studies. The chapter thereafter, chapter 3, outlines the research setup and the models to be estimated. The results and their analysis are given in chapter 4, followed by the conclusion in chapter 5.

2 Inference amongst high-dimensional data

In the introduction, it is mentioned that Belloni, Chernozhukov and Hansen (2014a) conclude that existing statistical techniques are not well-suited to deal with high-dimensional data, either because of overfitting or due to variable selection errors. This implies that suitable approaches are required to satisfy two criteria:

1. Select, from many potential predictor or control variables, only those with true predictive power; this is known as “regularization” or dimension reduction.

2. The model needs to be selected in such a way that conclusions about specific model parameters can be drawn.

2

The LASSO, developed by Tibshirani (1996), is an estimation technique which simultaneously selects the variables to be included, whilst estimating their coefficients.

(6)

Central to the first part of this chapter is how the double selection approach satisfies these criteria. The second and third section provide overviews of the methods used in the two case studies and outline if, and how, the researchers have kept the above requirements in mind.

2.1 Double selection approach

By design, Belloni, Chernozhukov and Hansen’s (2014a) double selection approach satisfies the above mentioned criteria for inference about parameters in high-dimensional data models. To deal with the first criteria, dimension reduction, they outline a number of techniques, of which Tibshirani’s LASSO model is preferred (Belloni, Chernozhukov, & Hansen, 2014a, pp. 31-33). The LASSO model is an estimation technique which simultaneously selects the variables to be included, whilst estimating their coefficients. This is achieved by minimising the sum of squared residuals, including a penalty for the sum of the absolute values of the coefficients, as follows: β̂ = arg min b ∑ (yi− ∑ xi,jbj p j=1 ) 2 n i=1 + λ ∑|b_j| p j=1 Υ_j ₍₁₎

where λ > 0 is the “penalty level” and Υ_j the “penalty loadings”. These penalties are set according to the algorithm outlined by Belloni, Chernozhukov and Hansen (2014b, p. 640); an algorithm which does not rely on cross validation to ensure reliable results. The algorithm contains the following steps:

1. Set λ to: λ = 2.2√2n log (2plog 𝑛_0.1)

2. Set initial value for Υj: Υj= √1_n∑ (Xni=1 i,jyi)2

3. Estimate β̂ and recalculate Υ_j: Υ_j = √1_n∑ (Xn _i,je_i)2

i=1 , with e = y − Xβ̂

4. Repeat step 3 until Υ converges.

Even though the LASSO model provides suitable forecasts, it is not designed for inference about the resulting model parameters, the second criteria. In order to combat this

(7)

challenge, Belloni, Chernozhukov and Hansen (2014a) propose two techniques. The first technique, which is not the double selection approach, is provided to clarify when the double selection approach is too extensive and a simpler technique suffices. This is the case when estimating a linear IV model with potentially many instruments, without the need for selection amongst control variables:

y_i = αd_i+ ε_i (2)

di = zi′Π + ri+ νi (3)

where E[ε_i|z_i] = E[ν_i|z_i, r_i] = 0, but E[ε_iν_i] ≠ 0. Additionally, d_i is a scaler endogenous variable, zi a p-dimensional vector of instruments, where p ≫ n is allowed, and ri an

approximation error. This technique is based on conventional two-stage least squares (2SLS), in which LASSO is used to select the instruments, meaning that the second-stage regression is immune to variable selection errors.

The second technique, the double selection approach, is designed to deal with a model of the following form:

y_i = αd_i+ x_i′_θ

y+ ryi+ ζi, (4)

a model in which d_i is taken as exogenous after conditioning on control variables x_i. Furthermore, the dimension of xi is allowed to be much larger than the number of

observations, r_yi is an approximation error, α the parameter of interest and E[ζ_i|d_i, x_i, r_yi] = 0. Naively applying a dimension reduction algorithm, such as LASSO, directly to this model, will likely lead to omitted-variable bias (Belloni, Chernozhukov, & Hansen, 2014a, pp. 35-36). The double selection algorithm on the other hand, introduces a reduced form equation, relating the variable of interest d_i to the control variables x_i, leading to the following system of equations:

y_i = x_i′_(αθ

d+ θy) + (αrdi+ ryi) + (ανi+ ζi) = xi′π + rci+ εi (5)

d_i = x_i′_θ

(8)

where E[ε_i|x_i, r_ci] = 0 and r_ci a composite approximation error. Now both equations represent predictive relationships and can thus be estimated using LASSO (or any other appropriate variable selection procedure). The union of the selected variables, from the predictions of y_i and d_i, is then used to estimate α by OLS or 2SLS. This double selection procedure ensures that the estimate for α is free from omitted-variable bias.

In summary, the double selection approach satisfies the two criteria for reliable inference about model parameters in the case of high-dimensional data, by applying LASSO to two variable selection steps, followed by OLS or 2SLS. Through these steps, the model dimensions are reduced, whilst remaining free from omitted-variable bias. How the procedure is applied in practice, is outlined in chapter three. The next two sections discuss the case studies and the estimation techniques used.

2.2 The effect of institutions on GDP

According to Belloni, Chernozhukov and Hansen (2014a), the double selection approach is a useful addition to existing IV estimation techniques, such as those used by Acemoglu, Johnson and Robinson (2012) in their study: ‘The Colonial Origins of Comparative Development: An Empirical Investigation’. A summary and an assessment of how the researchers have dealt with the criteria for estimation amongst high-dimensional data, mentioned in the introduction of this chapter, is provided below.

Acemoglu, Johnson and Robinson (2012) pose the question: “What is the effect of institutions on economic performance?”. Since they believe that it is likely that wealthy countries can afford better institutions, they choose to treat the quality of institutions, measured by the protection against “risk of expropriation” index from Political Risk Services, as an endogenous variable. As a result, they are required to find appropriate instruments for current institutions, for which they choose the mortality rates expected by the first European settlers in the colonies. The theory leading to this decision can schematically be represented as follows:

Figure 1 – Rationale for high settler mortality leading to low GDP (Acemoglu, Johnson, & Robinson, 2001, p. 1370)

(9)

The mortality rates are estimated using rates on soldiers, bishops and sailors stationed in the colonies between the seventeenth and nineteenth century and is mainly based on work by historian Philip D. Curtin (Acemoglu, Johnson, & Robinson, 2012, p. 3082). The current economic performance is measured using GDP per capita, resulting in the following 2SLS equations:

log y_i = μ + αR_i+ 𝐗_i′_{γ + ε}

i (7)

Ri = ξ + β log Mi+ 𝐗i′δ + νi (8)

where y_i is the income per capita in country i, R_i the protection against expropriation measure, 𝐗_i a vector of other covariates and M_i the settler mortality rate in 1,000 mean strength. The variable of interest is α, the effect of institutions on income per capita.

Acemoglu, Johnson and Robinson (2012) find a strong relationship between settler mortality and current institutions. The corresponding 2SLS estimate of the impact of institutions on per capita GDP is 0.94 with a standard error of 0.16. Based on these estimates, and estimates of variations of the aforementioned model, they conclude that better institutions lead to higher GDP per capita, a measure of economic performance. In their concluding remarks, they mention that institution were treated as a “black box” and that additional analysis on more fundamental intuitions is to be conducted in order to generate actionable findings.

Since the original data set, constructed by Acemoglu, Johnson and Robinson (2012), is not a high-dimensional data set, it is obvious that they have not formally kept the criteria for estimation amongst high-dimensional data in mind. However, they have applied dimension reduction and, under the assumption that they have correctly specified the economic relation between institutions and economic prosperity, ensured that inference about the parameter of interest is reliable. But, since the correct specification assumption might be somewhat unrealistic, it seems reasonable to assume that a more formal approach, such as the double selection procedure, yields additional insights.

2.3 The effects of federal development programs on local prosperity

Collins and Shester (2013) use similar techniques to those used by Acemoglu, Johnson and Robinson (2012) to estimate the effects of federal development programs on local prosperity.

(10)

This section provides an overview of their research and outlines if, and how, they have implemented the requirements for inference in the case of high-dimensional data.

The US urban renewal program, which started in 1949 and lasted approximately 25 years, has been criticised in most of the literature (Collins & Shester, 2013). Collins and Shester (2013, p. 239) find this surprising, since econometric evaluation of the program’s effects on local economies hardly exists, meaning that fundamental questions about its accomplishments are unknown. In response, they pose the question whether cities with a higher intensity of urban renewal activity have experienced different growth paths compared to cities which were observationally similar in 1950.

To answer this question, Collins and Shester (2013) believe that IV estimation is required, as cities both planned and executed the programs, leading to correlation between the dependent and main independent variable. They estimate the following relationship:

Y_ij80 = α + β₁UR_ij+ X′_ij50β₂+ δ_j+ u_ij80 (9) URij = γ + τ1Lij+ X′ij50τ2+ λj+ eij (10)

where Yij80 represents the economic outcomes3 in 1980 (conditional on the 1950 value of Yij),

URij the intensity of the urban renewal program, Xij50 a list of control variables and δj an

indicator variable for the census-division. Subscript 𝑗 indicates that a different relationship is fitted for each of the nine census-divisions. In equation (10), L_ij refers to the number of years of potential participation and λ_j a list of census-division fixed effects. To estimate this model, Collins and Shester (2013, p. 249) compiled a new data set for all cities with more than 25,000 residents in both 1950 and 1980, leading to a sample of 458 cities.

For the first stage regression, equation (10), they find that each additional year of eligibility for participation results in 10.32 additional dollars of grant per capita, with a standard error of 2.78 (Collins & Shester, 2013, p. 250). From the second stage regression, equation (9), they conclude that a $100 per capita difference in grant funding led to a 2.4 per cent higher median income in 1980 and a 6.9 per cent higher median property value. The estimated effects on employment and poverty were less precisely estimated (p > 0.05), but coincided with favourable effects (Collins & Shester, 2013, p. 254).

3

The most important economic outcomes measured are: the log of median value of owner-occupied property, log of median family income, employment rate, and poverty rate (Collins & Shester, 2013, p. 246).

(11)

Based on the first stage regression, Collins and Shester (2013, p. 251) conclude that delays in access to the program had a significant impact on the opportunity to plan and execute the program, leading to less favourable results. From the second stage regression, they draw the overall conclusion that the effects from the US urban renewal program were far more positive than generally perceived (Collins & Shester, 2013, p. 241).

Referencing the techniques used by Collins and Shester against the criteria for working with high-dimensional data, it can be concluded that they satisfy the second criteria, enabling inference about model parameters, by applying 2SLS. The way in which they have applied dimension reduction though, the first criteria, is less formally defined; they seem to establish which instruments and controls to use, through trial and error.

From the above text, it can be concluded that, albeit sometimes informally, the double selection approach, Acemoglu, Johnson and Robinson (2012) and Collins and Shester (2013) all satisfy the two criteria for inference about model parameters in the presence of high-dimensional data. The upcoming chapter outlines how this study tests whether the more formal approach to satisfying these criteria, the double selection procedure, leads to new insights about the research conducted by Acemoglu, Johnson and Robinson (2012) and Collins and Shester (2013).

3 Research method

The previous chapter provided details on the double selection procedure and the studies by Acemoglu, Johnson and Robinson (2012) and Collins and Shester (2013). This chapter outlines how the double selection approach is applied to these empirical examples.

The double selection approach is applied to the research by Acemoglu, Johnson and Robinson (2012) in the same way as Belloni, Chernozhukov and Hansen (2014a, pp. 45-48) do. They use the original data set by Acemoglu, Johnson and Robinson (2012), which includes 64 country-level observations. Using the following three steps, they obtain the double selection estimate for α, the parameter of interest:

1. Firstly, to ensure validity of post-model-selection-inference, a set of useful controls to predict the instruments Ri (the protection against expropriation measure) and Mi (the

(12)

settler mortality rate in 1,000 mean strength) is selected, by applying LASSO to equations (11) and (12).

R_i = x_i′Π̅₂+ ν̅_i (11)

M_i = x_i′γ + u̅_i (12)

Where x_i is the list of potential control variables used by Belloni, Chernozhukov and Hansen (2014a, p. 47).4

2. Secondly, to prevent the final estimation from omitted-variable bias, additional controls which predict y_i are selected, by applying LASSO to equation (13).

log y_i = x_i′β̅ + ε̅_i (13)

3. Finally, the treatment effect α is estimated using original IV equations (7) and (8), with 𝐗_i′_{the union of the variables selected in steps 1 and 2.}

The penalties for the LASSO regressions are set according to the algorithm outlined in chapter 2.1; an algorithm which does not rely on cross validation.

To test whether the effects of federal development programs on local prosperity can be predicted more accurately by the double selection procedure than by the regular IV estimation techniques used by Collins and Shester (2013), the following three steps are taken:

1. Generate a large list of potential control variables (X̃ij50) based on the original

long-list of controls (X_ij50) selected by Collins and Shester (2013). X̃_ij50 includes squares, cubes and cross-products.

2. Select a set of useful control variables (X̅_ij50) to predict Y_ij80, UR_ij and L_ij, by LASSO regressing each of these variable separately on X̃ij50.

3. Perform the IV estimation of equations (9) and (10), with X̅_ij50 for X_ij50.

4

A total of sixteen control variables: dummies for Africa, Asia, North America, and South America and a cubic-spline in latitude (twelve variables).

(13)

As stated in chapter 2.3, Xij50 is a data set newly created by Collins and Shester (2013, p.

249), containing observations on 458 cities; all cities with more than 25,000 residents in both 1950 and 1980.

This chapter laid out how the double selection approach is applied to measure the effects of institutions on GDP and federal development programs on local prosperity. The following chapter discusses and analyses the results.

4 Results and discussion

Because the output from Belloni, Chernozhukov and Hansen’s (2014a) application of the double selection procedure to Acemoglu, Johnson and Robinson’s (2012) work could exactly be reproduced and because the data set does not lend itself for further analysis, this chapter is solely dedicated to discussing the results from applying the double selection procedure to the work of Collins and Shester (2013). This is done in three parts. Firstly, a description of how the data set, used to perform the double selection procedure, is constructed from the original data set used by Collins and Shester (2013). This is followed by an overview of the relevant results and how these compare to the estimates obtained by Collins and Shester (2013), in section 2. The third, and final, part of this chapter is dedicated to discussing the findings in a broader context, supported by a set of suggestions for further research.

4.1 125 potential control variables

After considering a number of different IV model specifications, Collins and Shester (2013) conclude that none deviate significantly from their base specification (table 3, panel A, p. 254). The double selection estimates are therefore based on this design, which includes the following variables:

(14)

Table 1 – Descriptive statistics of variables included in Collins and Shester’s (2013, p.254) basic IV specification.

IV estimation based on these variables confirms the results presented by Collins and Shester (2013, pp. 250-254), apart from minor deviations in the clustered standard errors.5

The exogenous independent variables presented in Table 1 (X_ij50, including the census-division dummies), form the basis of the potential control variables (X̃_ij50) on which the LASSO regressions of the double selection procedure are performed. X_ij50 is complemented by X_ij502 (each variable in X_ij50, excluding the census-division dummies, squared), X_ij503 (each variable in Xij50, excluding the census-division dummies, cubed) and

5_{These deviations are possibly caused by an internal regularisation algorithm used by Stata, in which Collins} and Shester (2013) have performed their analysis. For this study, all calculation are performed using MATLAB. Such minor deviations were also noticed by Belloni, Chernozhukov and Hansen (2014a) in their accompanying notes.

Description Mean SD Median Min Max

Dependent variable (Yij8 0)

lnmedval80 Ln median property value in 1980 10.63 0.36 10.58 9.67 12.15 lnfaminc80 Ln median family income in 1980 9.83 0.17 9.83 9.15 10.60 pemp80 Employment rate in 1980 92.63 2.92 93.05 75.07 97.84 pfampov80 Percent of families in poverty 1980 11.04 5.14 10.12 1.84 38.93

Endogenous independent variable

app_funds_pc50 (URij) UR funds per capita (1950 population) 177.04 221.44 119.61 0.00 1896.48

Instrument

yrsexposure_UR (Lij) Years of potential participation in UR program 22.50 4.48 25.00 0.00 25.00

Exogenous independent variables (Xij50) **

pownocc50 Percent of units owner-occupied in 1950 52.32 11.69 52.90 12.40 88.20 lnmedval50 Ln median property value in 1950 9.03 0.29 9.02 7.72 9.90 pdilap50 Percent of units dilapidated 1950 6.62 5.58 4.83 0.29 35.47 poldunits50 Percent of 1950 units built before 1920 49.41 21.78 52.45 0.57 94.56 punitswoplumb50 Percent of units w/o full plumbing 1950 21.72 13.41 19.66 0.20 67.32 pcrowd50 Percent of units crowded in 1950 12.73 6.59 11.01 1.22 48.79 lnpop50 Ln population in 1950 11.09 0.88 10.83 10.13 15.88 pnonwht50 Percent nonwhite in 1950 9.28 11.82 4.20 0.00 60.70 plf_manuf_50 Percent of employment in manufacturing in 1950 29.51 15.25 29.20 4.10 67.90 pemp50 Employment rate in 1950 94.95 2.02 95.26 85.88 98.82 medsch50 Median years of schooling in 1950 10.31 1.27 10.20 5.50 12.80 lnfaminc50 Ln median family income in 1950 8.12 0.18 8.14 7.37 8.78 pinc_under2g_50 Percent of families with income under $2000 in 1950 21.41 9.00 19.05 5.10 61.30

Cluster variable

statefip (δj) Indicator variable for the census division (9 divisions)

* apart from lnmedval80, which contains 457 observations, all variables contain 458 observations ** Xij50 also includes a set of census-division dummies (λj)

(15)

Xij50∗ Xij50 (all possible cross-products based on Xij50, excluding the census-division

dummies). Since there are 13 variables in X_ij50, excluding the census-division dummies, X_ij502

and X_ij503 contain 13 variables and X_ij50∗ X_ij50 contains 78 variables6. In total, X̃_ij50 consists of 125 potential control variables.

Since Y_ij80 consists of four different dependent variables, the double selection procedure is performed four times, which implies 12 LASSO regressions (three regressions for each procedure). However, since the second and third selection steps, in which UR_ij and L_ij are regressed on X̃_ij50, are the same for each procedure, only six LASSO regressions are required. An overview of the variables selected by each of the regressions can be found in Appendix A Overview of LASSO-selected variables.

4.2 Different model specifications, similar results

An immediate observation, based on the six LASSO regressions discussed above, is that the control variables selected by the double selection procedure are completely different from the original set of exogenous independent variables used by Collins and Shester (2013). The formal LASSO procedure has selected cross-products, implying interaction effects, census-division dummies and cubed variables, instead of the original variables in X_ij50. Without considering the subsequent IV regressions, this implies that the estimates by Collins and Shester (2013) might suffer from omitted-variable bias.

Surprisingly, the vastly different model specifications lead to approximately similar results regarding β₁, the coefficient corresponding to the variable of interest (the intensity of the urban renewal program, UR_ij). The results are as follows:

6_{The first variable can be multiplied by 12 variables, the second one by 11, the third one by 10, etc. (i.e.} 12 + 11 + 10 + ⋯ + 1 + 0 = 78).

(16)

Table 2 – IV regression results: regular regression as performed by Collins and Shester (2013) and IV regression following Belloni, Chernozhukov and Hansen’s (2014a) double selection procedure. The clustered standard errors differ slightly from those originally reported by Collins and Shester (2013), which is likely the result of using different software packages (i.e. Stata versus MATLAB).

When simply looking at the β1 estimates and their corresponding errors, it could be

concluded that IV estimation based on the model specified by Collins and Shester (2013) outperforms IV estimation based on the double selection procedure. This would even be confirmed by higher R2 values. However, when investigating the results more closely, the followings observations stand out:

 It seems like Collins and Shester (2013) have misspecified their model. This conclusion is supported by the following observations:

i. Based on the standard errors of the regression coefficients and on the LASSO regressions, it appears that many of the 23 control variables used by Collins and Shester (2013) are insignificant.

ii. The LASSO regressions select variables, such as interaction effects, which have not been considered by Collins and Shester (2013), implying omitted-variable bias or a non-linear relationship.

iii. Collins and Shester have used the same model to predict each of the four dependent variables, which implies that each variable can be described by the same set of independent variables. The LASSO regressions show that this is not the case.

 It is likely that the R2 values obtained by Collins and Shester (2013) have been inflated by including insignificant control variables.

Ln median property value

Ln median family income

Employment

rate Poverty rate

Regular IV estimation

coefficients β1 of URij 0.000690 0.000241 0.00337 -0.00611

(clustered standard error) (0.000357) (0.000114) (0.00210) (0.00525)

Double selection

(17)

These observations suggest that the β₁ estimates obtained by Collins and Shester (2013, p. 254) are not as accurate as they appear to be, when considering them in isolation.

To test whether the β₁ estimates are indeed sensitive to specification errors, the IV regressions are repeated; but instead of using only significant variables as controls, a set of additional variables is added to each regression.7 The following results prove that β1 indeed

responds strongly to changing model specifications:

Table 3 – Results from IV regression with many controls. The list of controls consists of all variables, 23 in total, which are selected by any of the LASSO regressions 1a, 1b, 1c, 1d, 2 or 3 (see Appendix A). The only difference with the LASSO regressions of which the results are displayed in Appendix A, is that the regressors (𝑋̃_𝑖𝑗50), from which the variables are selected, consists of 𝑋_𝑖𝑗50 and 𝑋_𝑖𝑗50∗ 𝑋_𝑖𝑗50 only (i.e. 𝑋_𝑖𝑗502 and 𝑋𝑖𝑗503 have been excluded upfront).

When considering the results in Table 3 in isolation, one could conclude that the urban renewal program had a very significant and positive impact on all variables, apart from the median property value. Since the model underlying the IV estimates displayed in Table 3 have deliberately been misspecified, such a conclusion would be incorrect.

Going back to the criteria for inference about model parameters in the case of high-dimensional data, outlined at the start of chapter 2, it can be concluded that Collins and Shester (2013) might not have satisfied the second criteria, specifying the model in a way which enables reliable inference about the parameters. Their in-sample fit might be satisfactory, but it seems unlikely that out-of-sample predictions will be accurate too. Fortunately for Collins and Shester (2013) though, the signs of the coefficients of interest seem insensitive to changing model specifications, implying that their conclusion about the positive effects of the urban renewal program do not need to be altered.

7_{The controls included are the union of the variables selected during any of six LASSO regressions, regardless} of the appropriateness for the IV estimation in question (i.e. the same large set of control variables for each IV regression). Ln median property value Ln median family income Employment

rate Poverty rate

IV estimation with many controls

(18)

4.3 Double selection procedure provides additional insights

Since the minority of the variables selected by the double selection procedure were included in the original model specification by Collins and Shester (2013), it must be concluded that the technique seems ideal for predicting non-obvious relationships. Especially when the set of available control variables, either after expansion with non-linear combinations of the base variables or in its original form, is substantial, the double selection procedure provides a useful addition to the existing analysis tools. This is partially the case because it enables researcher to reduce the analysis time required, but mostly because it harnesses against overfitting, meaning that inference about model parameters is indeed more reliable.

Besides the directly observable advantages of the double selection procedure, mentioned above, the procedure also seems suitable for reliable out-of-sample predictions, as the model is likely to include all relevant variables and keep out the irrelevant ones. Since Belloni, Chernozhukov and Hansen (2014a) designed the double selection procedure for inference about model parameters, not for out-of-sample predictions, dedicating further research to the procedure’s appropriateness for making such predictions, seems suitable.

An obvious limitation of these conclusions is that they are based on one basic case study. As a result, two uncertainties prevail. The first uncertainty concerns the findings regarding Collins and Shester’s (2013) research. The question which remains is: What causes the differences between the various IV estimates? Rigorous analysis of the residuals might provide additional insights, but it could also be that explanatory variables are missing from the data set. The second uncertainty results from the fact that the findings are based on a single case, implying that additional research is required to confirm, or reject, this study’s conclusions.

5 Conclusion

Data are generated at an unprecedented pace. As a result, high-dimensional data sets, data sets with a large number of variables relative to the number of observations, are becoming increasingly common. Applying techniques such as OLS to these data sets, frequently leads to overfitting. To combat this challenge, specific techniques for dealing with high-dimensional data have been designed. But, although these techniques tend to do a good job at prediction, they often lead to incorrect conclusions about regression coefficients due to

(19)

variable selection errors. Therefore, if a researcher’s objective is reliable inference about model parameters in the case of unknown economic relationships, he or she is missing the right statistical tools.

An attempt at designing a technique which enables reliable inference comes from Belloni, Chernozhukov and Hansen. Their double selection procedure starts with dimension reduction, preferably using Tibshirani’s LASSO, by removing those variables which do not contribute to predicting the endogenous variable of interest, or the dependent variable. The reduced set of variables can subsequently be used in an OLS or 2SLS regression, resulting in a model which is free from omitted-variable bias and accurately captures non-linear effects. To assess the relevance of the double selection procedure, this research’s objective was to answer: To what extent can the double selection procedure provide additional insights about

the relationship between federal development programs and local prosperity, originally estimated by Collins and Shester (2013)?

Applying the double selection procedure to the work of Collins and Shester has highlighted that the procedure indeed assists in providing reliable inference about the parameter of interest. Its main strength seems to come from its ability to predict non-obvious relationships. This is achieved by conducting variable selection steps in which LASSO regressions determine the suitability of a large set of potential explanatory variables. Subsequently including these non-obvious relationships in the structural equation, which is estimated following the LASSO regressions, ensures that no variables are omitted and that the coefficient of interest is estimated accurately. Especially when the set of potential control variables, either after expansion with non-linear combinations of the base variables or in its original form, is substantial, the double selection procedure provides a useful addition to the existing analysis tools. This is partially the case because it enables researcher to reduce the analysis time required, but mostly because it harnesses against overfitting, meaning that inference about model parameters is indeed more reliable.

An obvious limitation of this study is that the conclusions are based on one basic case study. This implies that additional research is required to confirm, or reject, the robustness of the double selection procedure. Besides this obvious limitation, a question about the work of Collins and Shester prevails, as it remains unclear why the various IV specifications lead to substantially different estimates of the coefficients of interest. It could be that explanatory variables are missing from their original data set. Rigorous analysis of the residuals might

(20)

provide additional insights. Another topic for further research, could be the double selection procedure’s suitability for reliable out-of-sample predictions, as the structural equations following the selection steps are likely to include all relevant variables and keep out the irrelevant ones.

(21)

Bibliography

Acemoglu, D., Johnson, S., & Robinson, J. A. (2001). The Colonial Origins of Comparative Development: An Empirical Investigation. American Economic Review, 91(5), 1369-1401.

Acemoglu, D., Johnson, S., & Robinson, J. A. (2012). The Colonial Origins of Comparative Development: An Empirical Investigation: Reply. American Economic Review,

102(6), 3077-3110.

Belloni, A., Chernozhukov, V., & Hansen, C. (2014a). High-Dimensional Methods and Inference on Structural and Treatment Effects. The Journal of Economic Perspectives,

28(2), 29-50.

Belloni, A., Chernozhukov, V., & Hansen, C. (2014b). Inference on Treatment Effects after Selection among High-Dimensional Controls. Review of Economic Studies, 81(2), 608–650.

Collins, W. J., & Shester, K. L. (2013). Slum Clearance and Urban Renewal in the United States. American Economic Journal: Applied Economics, 5(1), 239–273.

IBM. (2016, May 07). What is big data? Retrieved from www-01.ibm.com: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html

Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal

(22)

Appendix A Overview of LASSO-selected variables

Table 4 – Results from LASSO regressing 𝑌_𝑖𝑗80 (regressions 1a – 1d), 𝑈𝑅_𝑖𝑗 (regression 2) and 𝐿_𝑖𝑗 (regression 3) on the full set of potential control variables 𝑋̃𝑖𝑗50. The next stage of the double selection procedure is an IV

regression in which the exogenous variables (𝑋̅𝑖𝑗50) are the union of the variables selected in LASSO

regressions 1a, 1b, 1c or 1d and regressions 2 and 3. For example, when predicting Ln median property value in 1980, 𝑋̅_𝑖𝑗50 is made up of the variables selected by LASSO regressions 1a, 2 and 3.

Selected variable Description of selected variable

Lasso Regression 1a: Ln median property value in 1980 (lnmedval80) on controls

lnmedval50_3 Ln median property value in 1950, cubed medsch50_3 Median years of schooling in 1950, cubed

poldunits50_x_plf_manuf_50 Percent of 1950 units built before 1920 * Percent of employment in manufacturing in 1950 plf_manuf_50_x_pinc_under2g_50 Percent of employment in manufacturing in 1950 * Percent of families with income < $2000 in 1950

Iregion_42 Census division #42

Lasso regression 1b: Ln median family income in 1980 (lnfaminc80) on controls

pownocc50_x_medsch50 Percent of units owner-occupied in 1950 * Median years of schooling in 1950 lnmedval50_x_pemp50 Ln median property value in 1950 * Employment rate in 1950

lnmedval50_x_medsch50 Ln median property value in 1950 * Median years of schooling in 1950 lnpop50_x_pinc_under2g_50 Ln population in 1950 * Percent of families with income < $2000 in 1950

plf_manuf_50_x_pinc_under2g_50 Percent of employment in manufacturing in 1950 * Percent of families with income < $2000 in 1950 pemp50_x_lnfaminc50 Employment rate in 1950 * Ln median family income in 1950

Lasso regression 1c: Employment rate in 1980 (pemp80) on controls

medsch50_3 Median years of schooling in 1950, cubed

lnmedval50_x_pemp50 Ln median property value in 1950 * Employment rate in 1950

lnpop50_x_plf_manuf_50 Ln population in 1950 * Percent of employment in manufacturing in 1950 pemp50_x_medsch50 Employment rate in 1950 * Median years of schooling in 1950

Lasso regression 1d: Percent of families in poverty in 1980 (pfampov80) on controls

lnpop50 Ln population in 1950

pownocc50_x_medsch50 Percent of units owner-occupied in 1950 * Median years of schooling in 1950 lnpop50_x_pnonwht50 Ln population in 1950 * Percent non-white in 1950

pemp50_x_lnfaminc50 Employment rate in 1950 * Ln median family income in 1950

Lasso regression 2: UR funds per capita in 1950 (app_funds_pc50) on controls

pownocc50_3 Percent of units owner-occupied in 1950, cubed

pownocc50_x_lnmedval50 Percent of units owner-occupied in 1950 * Ln median property value in 1950

Lasso regression 3: Years of potential participation in UR program (yrsexposure_UR) on controls

poldunits50_x_lnpop50 Percent of 1950 units built before 1920 * Ln population in 1950 pcrowd50_x_medsch50 Percent of units crowded in 1950 * Median years of schooling in 1950 lnpop50_x_plf_manuf_50 Ln population in 1950 * Percent of employment in manufacturing in 1950

plf_manuf_50_x_medsch50 Percent of employment in manufacturing in 1950 * Median years of schooling in 1950