High dimensional inference using lasso estimation

(1)

High Dimensional Inference using Lasso

Estimation

Wouter Antonius Gri

ffioen

10017631

MSc in Econometrics

Specialization: Econometrics and Financial Econometrics Date: July 15, 2016

Supervisor: Prof. Dr. Frank Kleibergen Second reader: Dr. Maurice Bun

(2)

Abstract

This study aims to link a comprehensive theoretical background of lasso estimation with its use in practice. Using a series of Monte Carlo experiments this study estab-lishes that hypothesis testing by a straightforward t-test on the by the classic lasso estimated parameters may result in erroneous conclusions. Two alternatives which build upon the classic lasso are suggested. Orthogonalizing the data with respect to a target parameter before applying the lasso estimation procedure and a practically feasible χ2-test to directly use on the lasso estimator. Both alternatives perform sig-nificantly better in terms of size and power when conducting the same Monte Carlo experiments. However, when the true data generating process has as many parame-ters as observations the orthogonalization method loses its size-correctness. As a final part of this study the three methods are applied for predicting the one year ahead con-sumer price index. The application shows how the lasso selects the few most relevant variables. Comparing mean squared errors shows that the rigorous lasso has the best predictive accuracy, with classic lasso estimation as a close second.

Keywords Lasso, Lars, High-dimensional, Statistical inference, Monte Carlo

Acknowledgements

First of all, I am grateful to Prof. Dr. Frank Kleibergen for supervising this thesis project, monitoring my progress and providing me with valuable insights and sug-gestions. Furthermore, I would like to thank Prof. Dr. Victor Chernozhukov and Dr. Martin Spindler for their support with the R-package hdm when needed. Next I am thankful to Dr. Anders Kock for providing his R-code for me to use. At last, I would like to thank Annemiek Griffioen and Nina van Ettekoven for revising.

This document is written by student Wouter A. Griffioen who declares to take full responsibility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty

of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

In their study on social construction of firm value Nijholt, Bezemer, and Reinmoeller (2015) find that using novel words like ’big data’ may lead to overvaluation of firms by security analysts. Apparently obeying the social perception of displaying progres-siveness increases your value as a company. It provides an explanation for the recent flood of using the term big data in the media and, more importantly, in the mission statements and company descriptions of firms.

However, stating that big data is of importance for your company is something different than actually using it. Correctly analyzing large amounts of data is more difficult than it may seem. According to Carter (2014) the biggest mistake companies make is that they are collecting and managing large amounts of data with little fore-thought, they have no idea how to effectively make sense of it. The fact that 90% of all the data in the world has been generated over the last few years but about only 0.5% is really analyzed only strengthens this argument (Dragland, 2013).

Furthermore, there is a large shortage of individuals skilled and educated well enough to make data analysis truly beneficial. The McKinsey Global Institute esti-mates that the demand for employees skilled in data analysis surpasses supply by 50% to 60% in 2018 (Manyika et al., 2011). The direct consequence is that employees currently doing something with data are forced to provide statistical analysis they may not be capable of or educated for. This might lead to erroneous managerial deci-sions based on data that can cause harm in the long term. Or as the former librarian Rutherford D. Roger once put it: ”We are drowning in information but starving for knowledge”.

It is nevertheless clear that the processing of large amounts of data has become more and more important. From science to sports and from business to entertainment, data is mined and analyzed every day and the benefits seem promising. It is crucial to be able to filter out the noise and capture the essential signal of the data. But as argued above it is difficult for the unspecialized practitioner, the employee in the data department or researchers from fields that do not specialize in statistical methods, to correctly employ statistical models in order to solve the problems at hand.

A problem of working with large data sets is that the vast majority is complex and unstructured. If the aim is to estimate effects of different variables on a certain dependent variable or simply seek out the variables that affect it most, it is desirable to be able to shrink the data down to a few key factors that capture the bare essentials of the data. Doing so automatically imposes an important assumption, namely that the world is not as complex as it might be. For example, that customer ratings on a dozen books and movies give a good idea of the customer’s taste. Or that only a few factors are important in predicting the outcome of a game of football. Or that not all

(6)

out of the about 30000 genes in the human body are directly involved in the process that leads to development of cancer.

One form of this simplicity assumption is sparsity. A sparse model may be de-fined as a model in which only a small number of parameters play an important role (Hastie, Tibshirani, & Wainwright, 2015). Exploiting sparsity to help recover the un-derlying information in a data set may be best illustrated with an example. Consider the linear regression model below.

y_i= β0+ k

X

j=1

x_ijβ_j+ ε_i (1)

where β0, β1, ..., βk ∈ Runknown parameters, xij ∈ Rregressors, or explanatory

vari-ables, and ε_i∈ R_{an error term. The least squares method provides estimates of these} unknown parameters by minimization of the squared error, or

min β XN i=1 (yi−β0− k X j=1 xijβj)2 . (2)

In general the least square estimates of β will be nonzero. This makes the interpre-tation of the model difficult if k is large or, worse, if k > N the solution is not even unique. In that case there is an infinite set of solutions that set the objective function equal to zero. Thus, in order to be able to find a meaningful solution there is a need to constrain the problem somehow thereby regularizing the estimation process.

A novel way to add a constraint to the optimization problem is provided by Tibshirani (1996) and named the lasso regression. The lasso regression estimates the parameters not by solving Equation (2), but by solving

min β _XN i=1 (yi−β0− k X j=1 xijβj)2 ,subject to β ₁≤t, (3) where β ₁ = Σ k

j=1|βj| is the `1 norm of β and t ∈ R is a user defined parameter.

The advantage of using the `1 norm is that if the constraint is set small enough the

lasso yields sparse solution vectors, setting only few parameters to nonzero values. This makes the interpretation of the model convenient and clear-cut with respect to the classical least squares estimator. As an extra advantage the sparsity in combina-tion with the fact that it yields a convex problem greatly simplifies computacombina-tions. A third argument for the use of sparsity is the ’bet on sparsity’ principle, stating that one should employ a procedure that does well in sparse problems, since no problem works well in dense problems.

The introduction above reveals a problem, loosely speaking there are two kind of worlds involved in the analysis of large data sets. On the one hand, there is the media

(7)

and the around the corner data analysts that are full of the popular notion of big data and often perform analyses that are unreliable or simply wrong. On the other hand, there is an abundance of theory which provide neat proofs based on certain assumptions that is difficult to, and sometimes unclear how to, apply correctly to arrive at trustworthy results.

The goal of this study therefore is to connect the two worlds, it aims to provide theoretical background while keeping the practitioner in mind. The lasso estimator is the central estimation procedure to this study but two more sophisticated methods provided by Belloni, Chernozhukov, and Hansen (2014) and Caner and Kock (2014) that build on the lasso are treated as well. The idea is to analyze the behavior of the methods when performing statistical inference using a series of Monte Carlo simula-tions that all have a slightly different data generating process. The estimation results of all methods are highly dependent on the size of the constraint which has to be set by the practitioner. Therefore, this plays a key role in the analysis. Furthermore, the ease to apply a method is important to really provide the practitioner with something that can be worked with.

In order to achieve common ground between practice and theory this study is set up as follows. The subsequent chapter provides the core theoretical background for this study. It treats the used optimization algorithm to arrive at the lasso estima-tor, provides a comprehensive review of the lasso itself and explains its appealing properties. Next to that the final section of Chapter 2 explains the often used cross-validation method for setting the constraint on the optimization. The third chapter sets out the different Monte Carlo experiments that are used throughout this study. Furthermore, it performs the experiments on the size of testing for statistical signifi-cance in the most straightforward setting as the practitioner would apply it, thereby immediately revealing issues that may arise.

Chapter 4 uses the orthogonality principle before conducting inference on a target coefficient as provided by Chernozhukov, Hansen, and Spindler (2014). It performs the same experiments as in Chapter 3 but also analyzes its power in addition to the size properties. The same analysis is done for Chapter 5 but now with a different test provided by Caner and Kock (2014). Both Chapter 4 and 5 begin with providing background theory for the relevant applied method.

After performing the same series of simulation exercises on all three methods, Chapter 6 makes a clearer comparison of their power and performs a high dimen-sional simulation exercise to put the more sophisticated methods to the test. Even though the simulations are performed as if it would be used in practice, to come full circle Chapter 7 applies all three estimation procedures on real data. In this applica-tion the factors most relevant in predicting future inflaapplica-tion are selected using lasso as an extension on Kliesen et al. (2008). Next to that, the predictive power of three lasso alternatives is compared using mean squared errors. This gives a better insight in

(8)

how to apply lasso estimation and the benefits it brings in selecting the most impor-tant variables. Finally, Chapter 8 summarizes the main findings of the experiments and concludes on the ease of use and applicability of the three methods. Next to that it makes some concluding remarks, treats limitations of this study and provides suggestions for both the practitioner and future research.

2 Forward selection and lasso estimation

To be able to apply the lasso estimation procedure and test for the significance of parameters in the lasso setting, it is essential to gain understanding of the background theory and the algorithms at hand. This chapter therefore discusses the theory of forward selection and in particular the lasso estimation procedure, which is the main interest of this study.

Forward selection, sometimes referred to as forward stepwise regression, is de-scribed by Weisberg (2005) as follows. Let y, a n × 1 vector, be the dependent variable and xj possible n × 1 explanatory variable or predictor vectors, where j = 1, ..., p with p > k and p, k, n ∈ N. Given this set of possible predictors, select the one having the

largest correlation with the dependent variable y. Let this be, without loss of general-ity, x1and perform simple linear regression of y on x1. This regression leaves a

resid-ual vector orthogonal to x₁, which now is considered to be the dependent variable. Perform orthogonal projection of the other explanatory variables to x1and repeat the

selection process by finding the xj that has the largest correlation with the new

de-pendent variable. After k steps this results in a set of explanatory variables x₁, ..., x_k

that are used to construct a k-parameter linear model.

The procedure of forward selection is aggressive in the sense that it might, for instance, at the second step eliminate useful explanatory variables that could be cor-related with x1 (Efron, Hastie, Johnstone, Tibshirani, et al., 2004). A more prudent

version of forward selection is forward stagewise, which may take numerous small steps as it moves forward to a final model. Forward stagewise therefore is not as aggressive as forward selection but it greatly inflates the steps to be taken, thereby causing a computational burden.

With forward selection being too aggressive and forward stagewise too cautious it comes natural to aim for middle ground. The least angle regession (lars) algorithm, studied by Efron et al. (2004), provides this by being not as rigid as forward selection but taking larger steps than forward stagewise. The subsequent section lays out the intuition of the lars algorithm as provided by Efron et al. (2004). Finally, section 2.2 describes the lasso estimation procedure and establishes the connection with the lars algorithm.

(9)

2.1 Least angle regression

Although this study is about the lasso estimator this section provides the lars algo-rithm. The motivation to do so is that the lars algorithm is an intuitive one and is closely related to lasso. Indeed a slightly modified lars algorithm yields all lasso so-lutions and is therefore used in this study (Efron et al., 2004). This section only treats the intuitive functioning of the least angle regression, more detail and mathematical proofs are provided by Efron et al. (2004).

The lars algorithm starts with all coefficients equal to zero and finds the predictor that is most correlated with the explanatory variable y, as above say x1 is the most

correlated with y. The algorithm then takes as large a step possible in the direction of x1up to the point that there is another explanatory variable, say x2, that is just as

correlated with the current residual as x1. Now, instead of continuing in the direction

of x1, lars advances in the direction equiangular between x1and x2 up to the point

where a third predictor, x3, is just as correlated with the current residual vector as x1

and x₂. The lars algorithm then proceeds equiangularly between x₁, x₂and x₃until a fourth explanatory variable enters and so on (Efron et al., 2004). This process shows that the algorithm is continuously advancing along the least angle direction, hence its name.

This process may become more clear when graphically explained in a simplistic setting. Figure 1 shows an example with just two explanatory variables, the path of the lars algorithm is indicated by using red. Lars builds up estimates ˆµ = X ˆβ step

by step, each step adding one predictor so that after k steps just k of the estimated parameters are nonzero. In this case, with just two predictors, the linear space S(X) is spanned by x₁and x₂. In Figure 1, ¯y is the projection of y onto S(X). The residual

¯

y− ˆµ0has a larger correlation with x1than with x2, so the algorithm moves along x1up

to ˆµ1. At ˆµ1the residual ¯y − ˆµ1exactly bisects x1and x2, so that the correlations equal

each other. The next step then is to move equiangular between x₁ and x₂ towards ¯

y. If there would be more variables the direction once more is changed when a third

(10)

ˆ µ0 µˆ1 x1 x2 x2 ¯ y

Figure 1: Graphical representation of the least angle regression algorithm.

2.2 Lasso estimator

This section defines the lasso estimator in accordance to Tibshirani (1996) and pro-vides some insight into the functioning of the lasso estimator.

In a linear regression setting we have a set of N samples {(xi, yi)}Ni=1, where each x_i= (x_i1, ..., x_ik) is a k-dimensional vector of explanatory variables, and each y_i∈ R_is the associated dependent or response variable. The aim is to approximate yi using a

linear combination of the explanatory variables. The usual procedure is by minimiz-ing the squared error loss, which results in the ordinary least squares (OLS) estimator. The lasso procedure uses the OLS strategy but adds a constraint to the optimization problem.

Letting ˆβ = ( ˆβ1, ..., ˆβk) the estimated parameter vector and ˆα a constant, Tibshirani

(1996) defines the lasso estimate ( ˆα, ˆβ) by

min β _XN i=1 (yi−α − k X j=1 xijβj)2 , subject to β ₁= k X j=1 |_β_j| ≤_t. ₍₄₎

Here t ≥ 0 is a tuning parameter and β

₁ the `1 norm of β. The smaller t is set the more constrained the problem and, thus, the more limited the data can be fitted. Furthermore, for all t, the solution for α is ˆα = ¯y and, without loss of generality, the

assumption that ¯y = 0 can be made so that the constant α may be omitted.

Equation (4) may be rewritten in the following Lagrangean minimization problem

min β ₁ 2N N X i=1 (yi− k X j=1 xijβj)2+ λ k X j=1 |_β_j| , (5)

as defined in Hastie et al. (2015). In this setting the parameter λ appears instead of the tuning parameter t and from here on is called the lasso penalty. Although the relation between t and λ is dependent on the data at hand they approximately take the same role in tuning the lasso estimator, be it in reverse of their magnitude.

(11)

The computation of the lasso is a quadratic programming problem which can be solved by standard numerical analysis algorithms, like cyclical coordinate descent which is thoroughly explained by Hastie et al. (2015). But the least angle regression procedure provides an efficient way to compute the lasso solutions simultaneously for all values of the penalty. However, the lars algorithm described in the previous section needs a slight modification to do so. In the lars algorithm it may happen that a non-zero coefficient hits zero but stays in the active set as a consequence of the mono-tonically increasing nature of the algorithm. For the lasso estimation, if this is the case the coefficient needs to be removed from the active set of explanatory variables and the joint direction needs to be recalculated. Thus, the lasso procedure allows the active set of parameters to decrease. Furthermore, the increases and decreases in the active set are assumed never to involve more than one at the time (Efron et al., 2004). From the definition of the lasso estimator it is clear that the penalty chosen is of great influence on the estimator. Setting the penalty very large causes the model to be over-simplistic whereas setting it too small adds unintentional noise. Furthermore, in gaining understanding of the ability from the lasso procedure to arrive at a sparse solution vector it is useful to demonstrate the procedure in a simple example. There-fore, the next two subsections provide insight in the sparse solution property which makes the lasso estimator so appealing and set out the cross-validation method in finding the optimal penalty λ.

2.2.1 Sparse solutions

To explain why the lasso has the property to set coefficients to zero, and thus yield sparse solutions, the graphical representation given in Figure 2 is insightful. It com-pares the lasso estimator to an estimator which uses a quadratic constraint, better known as the ridge regression, to show the advantage of using the `₁ norm as the lasso does.

(12)

Figure 2: Estimation picture for the `1 norm on the left and `2 norm on the right

(Hastie et al., 2015).

The diamond on the left and the circle on the right in Figure 2 around the origin form the constraints |β₁|_{+ |β}₂| ≤_{t and β}2

1+ β22≤t2following the setting of Equation 4

whereas the ellipses form the contours of the residual sum of squares. Both methods find the first point where the contours hit the edge of the constraint but for the `1

norm it can hit a corner where one of the coefficients is zero, whereas for the `₂norm this coefficient will always differ slightly from zero. In larger dimensions (k > 2) the diamond becomes a rhomboid, and has multiple corners, flat edges and faces which increases the opportunities for estimated parameters to be zero. Hence a key property of the `1constraint is its ability to yield sparse solutions.

The size of the constraint controls at which point the residual sum of squares hits its boundary. Clearly, the larger the constraint t the closer the lasso solution lies to the OLS estimate ˆβ, once again showing the importance of the constraint. By considering

the Lagrangean problem in Equation (5) the penalty λ takes this role of controlling lasso solution. The following subsection discusses the cross-validation method to find the optimal size of the penalty.

2.2.2 Cross-validation

The bound t in the lasso criterion (4) or the penalty λ in (5) are of great importance since they control the complexity of the problem. As stated earlier t and λ do not follow a one to one relation but the following cross-validation procedure provides similar results (Hastie et al., 2015). Since the lasso penalty λ is used in the simulations in the subsequent chapters, this subsection considers cross-validation for λ.

(13)

Small values of λ free more parameters and give a better fit to the data whereas large values of λ lead to a deterioration in the data fit but produce sparser more in-terpretable models. Therefore, setting λ can be seen as a trade-off between capturing the underlying signal of the data and overfitting leading to also capture the noise in the data. Neglecting interpretability it is possible to find λ such that it gives the most accurate model for predicting independent test data from the same population. In order to estimate this ’optimal’ value of λ, cross-validation can be used (Hastie et al., 2015). The cross-validation procedure is described below.

Randomly divide the data into K > 1 subgroups. Set one of these subgroups as test set and the other K − 1 as training data. Apply the lasso estimation procedure to the training data for a range of different penalties λ and use each fitted model to predict the response in the test set and compute the prediction error. This process can be repeated K times where every subgroup plays the role of the test set once so that this leads to K estimates of the prediction error. Then average these K estimates for each value of λ, resulting in a cross-validation error curve. Subsequently one could pick the penalty λ for where the cross-validation error is the lowest, which is the ’optimal’ value of λ in the sense explained above.

The optimal lasso penalty λ found according to the cross-vaildation method in this subsection plays an important role in the subsequent chapter where the perfor-mance of a simple t-test on parameter significance is examined.

3 Simulation setup and parameter significance with lasso

The concept of lasso estimation is to identify a small number of important parame-ters from large data sets. A relevant issue is the significance of the parameparame-ters that the lasso procedure includes. However, since the lasso estimator is the result of an optimization algorithm it could conceivably be problematic to test their significance. In order to do so, this study sets up a series Monte Carlo experiments in a controlled environment to discover the exact behavior of the rejection frequencies when testing the model parameters for significance. The idea is that when it is controlled what data exactly is fed to the algorithm it is known what results should be produced and, consequently, conclusions can be drawn from the test behavior. The lasso estimates in this chapter are produced by using the R-package glmnet from Friedman, Hastie, Simon, and Tibshirani (2016) that is freely available.

In general there are two main practical concerns with hypothesis testing. First, tests may have the wrong size, meaning that the actual probability of rejection of the null hypothesis may be higher or lower than the predefined nominal significance level. Second, the power of a test may be low, meaning that there is a low probability of rejecting the null hypothesis when it should be rejected. However, when using the lasso estimation procedure an additional complication emerges, namely the penalty

(14)

severity λ in equation (5). Therefore, the interest not only lies in the size and power of a significance test for different true values of the parameters β but also for the size and power as functions of λ when testing for the true value of the parameters.

As explained in section 2.2.2, a method to determine the penalty in the lasso opti-mization problem is to use cross-validation. If the resulting penalty is treated as the optimal penalty it is useful to inspect the size of a test around this optimal penalty. The next section outlines the set up of the simulation used in this chapter whereas the remaining sections discuss the simulation results.

3.1 Monte Carlo and t-test

Before setting out the Monte Carlo experiments used in this study it is necessary to define the t-test that is used for statistical inference throughout this chapter and the next.

The most simple test on a hypothesis is the t-test as if it would be applied in the ordinary least squares (OLS) setting. The t-statistic for testing H0: βj = βj,0 is then

given by tj= ˆ βj−βj,0 ˆ σj = ˆ βj−βj,0 ˆ σ q ( ˜X0 _˜ X)−_jj1 , (6)

see, for instance, Heij et al. (2004). Note the ˜X, which consists of only the columns

of X that correspond to the parameters that are nonzero in the lasso estimate. This

t-statistic forms the basis of the experiments in this section. Since the lasso solution

only corresponds one to one with OLS when the lasso penalty equals zero the rejection frequency might differ from the predefined rejection frequency, which in this study is set at α = 0.05.

The further set up of the experiments is as follows. For the Monte Carlo simula-tion, 10,000 replications are used with the data generating process being the classical linear regression model (CLRM) y = Xβ + ε, where ε ∼ N (0, 1), y is a n × 1 vector of the dependent variables, X a n × k matrix of regressors and β a k × 1 vector of parameters with n = 500 and k = 20. Note that this is not a truly high-dimensional case in the sense that k is rather small and n >> k.

The mean of each regressor and the dependent variable are assumed to equal zero which can be justified by location transformations. The hypothesis to test in this model is the true value of the estimator(s) β. Three cases are considered. The first and most primitive is that where β₁ equals 1 and all β_j with j = 2, ..., k equal zero. It is expected that in this case the simulated rejection frequency is fairly close to the predetermined rejection frequency as long as the penalty is not excessive. In the second case again β1 = 1 but now βj with j = 2, ..., 6 are some random number

between −0.1 and 0.1 and βj for j = 7, ..., k equal zero. It is of interest to see if these

(15)

on the small, but significant, β2to β6. Additionally, correlation within the regressors

is a factor that might perturb the t-test, therefore this is included as a third case in the Monte Carlo simulation with a correlation degree of ρ = 0.75.

The results of these simulation experiments are presented in subsequent sections. 3.2 Significantβ1

In this section the k × 1 vector of parameters in the CLRM is defined by β = (1, 0, ..., 0), where k = 20. The interest lies in discovering the rejection frequency of the t-test with the test statistic as given in Equation (6), where in this case j = 1. Subsequently, the null hypothesis is given by H0: β1= 1 and is tested against Ha: β1, 1. The nominal significance level is set at α = 0.05. The simulation results are given in Figure 3.

The figure shows the rejection frequency of the t-test as a function of the natu-ral logarithm of the lasso penalty λ. The upper horizontal axis shows the number of parameters included by the lasso estimation procedure as a consequence of the mag-nitude of the penalty and the dashed vertical line shows the average optimal penalty according to cross-validation.

The smaller the penalty, the more closely the rejection frequency approaches the nominal significance level. This is a rather intuitive result since as the penalty de-creases the estimator approaches the OLS estimator. Furthermore, the larger the penalty becomes, the higher the rejection frequency. This implies that if the penalty is too large the t-test almost always rejects the null hypothesis while it is true, show-ing that the lasso estimator is not able to detect the true signal of the data when the chosen penalty is too severe.

Interestingly, at the average optimal penalty according to the cross-validation pro-cedure the t-test over-rejects the null hypothesis. As can be seen from Figure 3 the rejection frequency is about thirty percent at the average optimal penalty. Therefore, even at the most simple case when the optimal penalty is chosen the standard t-test is not able to reject the null hypothesis with a frequency according to its nominal size.

(16)

-5 -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 17 17 17 17 9 3 3 3 1 1 1 1 1 1 1 1 1 Number of nonzero variables

Figure 3: Rejection frequency of H0: β1= 1.

3.3 Significantβ1with nuisance parameters

As described in the simulation setup the second case introduces five additional small, but significant, parameters. The first parameter remains equal to one, but the second to sixth are determined using a uniform distribution between −0.1 and 0.1. The k × 1 parameter vector in the CLRM is then given by β = (1, γ1, ..., γ5, 0, ..., 0) with γi ∼

unif(−0.1, 0.1) for i = 1, ..., 5. The resulting rejection frequency of the t-test for the null hypothesis H₀: β₁= 1 is shown in Figure 4.

The red line in Figure 4 shows roughly the same course as in Figure 3, for a minus-cule penalty the rejection frequency about equals the nominal level of 0.05 whereas if the penalty increases the rejection frequency increases up to an undesirably high level as was the case in the previous section. However, there are some differences to be observed in the course of the rejection frequency, it increases more sudden from a more or less steady rejection frequency even below the nominal level up to about sixty percent level.

When turning to the optimal penalty according to cross-validation a notable dif-ference arises. The average optimal penalty to be chosen now is smaller as can be observed from the dashed vertical line. This means that the lasso procedure includes more parameters. More important is the fact that the rejection frequency is much closer to the nominal level of five percent. This shows that if the true model consists of a small number of significant nuisance parameters the lasso estimate for the

(17)

pa-rameter of interest might perform better in terms of size for a cross-validated penalty than in the most simple case described in the previous section.

Summarizing, two things can be learned from the comparison of the second case with the first. The course of the rejection frequency as a function of the chosen lasso penalty is not heavily dependent on whether there is only one significant parameter or if there are additional smaller parameters included in the data generating process, although there are some differences in suddenness of increase. Secondly, the average optimal penalty is adjusted downwards causing the lasso to include more parameters and, in combination with the more sudden increasing path, the rejection frequency at the optimal penalty to be lower when additional small parameters are included relative to the case with only one large significant parameter.

-5 -4 -3 -2 -1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 20 20 20 18 15 10 5 5 4 2 1 1 1 1 1 1 Number of nonzero variables

Figure 4: Rejection frequency of H₀: β₁= 1, added nuisance parameters.

3.4 Correlation in regressors

In practice it might very well be the case that correlation within the data plays a role, which is therefore regarded in the third case of this simulation study. A correlation of

ρ = 0.75 is imposed between the explanatory variables. Furthermore the true vector

of parameters is given by β = (1, 0, ..., 0), as in the first case. Figure 5 shows the rejec-tion frequency of the t-test of the null hypothesis H0 : β1 = 1 when the explanatory

variables are correlated.

(18)

case, but with a maximum rejection frequency of eighty percent. Observing the num-ber of nonzero variables makes clear that the lasso procedure includes less variables around the cross-validated optimal penalty, which on its turn is more similar to the first case. When more variables are included the t-test almost never rejects the null hypothesis which is a consequence of the high correlation between them. However, at the optimum penalty the rejection frequency is already on its rise and thus over rejects the null hypothesis, be it less severe than in the first case. This shows that sim-ply apsim-plying the lasso and setting the penalty using cross-validation forms a problem in practice when there might be correlation within the regressors.

To conclude, a large degree of correlation in the data has its effect on the rejec-tion frequency of the null hypothesis using the t-test defined in Equarejec-tion (6). This effect, however, drives the rejection frequency at the optimal lasso penalty closer to its desired significance level than in the case without correlation although it is still far off. -6 -5 -4 -3 -2 -1 0 0.0 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 19 19 18 14 14 9 6 5 2 1 1 1 1 1 1 1 Number of nonzero variables

Figure 5: Rejection frequency of H0: β1= 1, correlated data.

The Monte Carlo experiment and its results displayed in this chapter show that the most simple lasso procedures as the practitioner would use them form ground for concern. As showed, in general the size properties of a simple t-test on a param-eter of interest are poor and may very well lead to wrong conclusions. Indeed the lasso procedure entails appealing properties of shrinking the model down to a few parameters, although section 3.3 shows that this may be less effective when there are

(19)

multiple significant nuisance parameters. However, truly exploiting the benefits of the lasso estimator needs trustworthy statistical inference so that practical research is able to provide results with confidence. In order to do so the next two chapters seek for more sophisticated methods that build upon the classic lasso estimator and test those according to the same Monte Carlo set-up and their ease to be applied in practice.

4 Rigorous lasso and inference on target coe

fficients

The first more sophisticated method this study addresses is a lasso estimation proce-dure following a series of studies by mainly Belloni and Chernozhukov. They propose a feasible and data-driven lasso penalty in Belloni, Chernozhukov, and Hansen (2011) which improves upon the cross-validated severity of the lasso penalty. Furthermore in Belloni, Chernozhukov, and Hansen (2013) and Belloni, Chernozhukov, and Wei (2013) a method for inference on a target variable in a high-dimensional setting is developed. Both additions upon the classical lasso estimator are provided in the R-package hdm.

Since the interests of this study lies in analyzing the rejection frequency as a func-tion of the lasso penalty, fixed values of λ are used. Thus the rigorous lasso is disre-garded in the Monte Carlo experiment, although the penalty resulting from the rigor-ous lasso is plotted along with the results. The focus remains at inference on a target variable, this target variable follows naturally from the set up of this study to be β1

in the in chapter 3 defined data generating process.

The next section describes the inference on a target variable after selection among high-dimensional controls as provided by Belloni, Chernozhukov, and Hansen (2013) and Belloni, Chernozhukov, and Wei (2013). It additionally explains the idea behind the rigorous lasso. Sections 2, 3 and 4 treat the three cases of just a significant β1,

ad-ditional small but significant β2, ..., β6and a high degree of correlation in the data. The

final section of this chapter moves from the size properties to analyzing the power for some fixed values of the lasso penalty.

4.1 Honest inference on target coefficients and rigorous lasso

Before conducting the Monte Carlo experiments of the previous chapter in this more sophisticated setting it is of importance to gain some insight in the working of the estimation procedure and the addition over the lasso described in Chapter 2.

Although it is not explicitly used in the simulations the rigorous lasso is treated in the next subsection. The motivation to do so is twofold. First, for each simulation the optimal penalty of the rigorous lasso is computed so it can be compared to the cross-validated penalty of the previous chapter. Second, if the practitioner would apply

(20)

the method of this chapter it is of importance to gain some understanding of how the algorithm arrives at its specific penalty. Moreover, the rigorous lasso is used in the application of Chapter 7. The second subsection treats the estimation procedure itself.

4.1.1 Rigorous lasso

To explain the refinement of the rigorous lasso recall Equation (5), the lasso estimator as defined by Tibshirani (1996) in its lagrangean form. This definition of the lasso estimator, which is labeled classic lasso from hereon, imposes equal weights on all parameters. It, however, might be optimal to provide more flexibility and define the lasso estimator such that the weights are allowed to differ for different parameters.

Chernozhukov, Hansen, and Spindler (2016b) argue for different weights based on the data and define the rigorous lasso estimator as

ˆ β = min β 1 n n X i=1 (y_i−_x0 iβ)2+ λ n Ψˆβ ₁, (7) where β

₁is the `1-norm of β and ˆΨ = diag( ˆψ1, ..., ˆψp) is a diagonal matrix consisting of the individual weights or penalty loadings. Allowing for different weights does not make the choice of the penalty less crucial. Cross-validation is still possible and indeed practically popular, however Chernozhukov et al. (2016b) argue that it lacks theoretical justification and therefore propose a different method that is data driven, theoretically grounded and feasible. The estimator is therefore more rigorous, hence its name.

Belloni and Chernozhukov (2009) propose the following X-dependent penalty level λ = c2 ˆσ Λ(1 − ζ|X), (8) where Λ(1 − ζ|X) = Q(1 − ζ)ofn 1 n n X i=1 xiνi ∞ |_X, ₍₉₎

here Q(·) is the quantile function and ν_iare iid N (0, 1), generated independently from

X.

The idea behind this choice of the penalty is the trade off between controlling the noise while keeping the bias as low as possible. Therefore the penalty should be chosen as small as possible but just large enough for the noise, ˜cn

1 n Pn i=1xiνi ∞, to be

dominated by the regularization with a sufficiently high probability. The probability (1 − ζ) thus needs to be close to one and ˜c > 1. Setting the penalty as above, with

c > ˜c and where Λ(1 − ζ|X) is the (1 − ζ)-quantile of n 1 nσ Pn i=1xiνi ∞, provides the

minimum penalty to control the noise. The estimate ˆσ is data-driven and the quantity

(21)

a more comprehensive treatment of this proposed penalty the reader is referred to Belloni and Chernozhukov (2009) and Belloni, Chernozhukov, and Hansen (2011).

The appeal of this data-dependent penalty is that it automatically adapts to highly-correlated designs since it employs less severe penalization in that case (Belloni, Cher-nozhukov, & Hansen, 2011). This property is attractive since in practice correlation is not uncommon. Furthermore, it is of interest since it might be observed from the simulation case with high correlation as it should alter the optimal penalty set by the rigorous lasso.

Additionaly Belloni, Chernozhukov, and Hansen (2011) propose what they call the post-lasso estimator. This simply refers to performing OLS on the model that only includes the parameters selected by the rigorous lasso procedure. This extra step is not applied in the simulations of this study but might be easy to use for the practitioner.

Having explained the choice of the penalty in the rigorous lasso, the next sub-section explains in more detail the estimation procedure provided by Belloni, Cher-nozhukov, and Hansen (2013) and how honest inference is conducted on a specific target variable.

4.1.2 Inference on target coefficients and the orthogonality principle

This subsection follows the framework of Belloni, Chernozhukov, and Hansen (2013) and utilizes Neyman’s orthogonalization (Neyman, 1959) better explained in the spirit of this study by Chernozhukov et al. (2014).

Consider inference on the target coefficient β1in the model yi= diβ1,0+ x 0 iβ−_1,0+ ε_i, E[ε_i(x 0 i, d 0 i) 0 ] = 0. (10)

Here di is the target regressor and β−_1,0stands for β₀excluding its first element and

thus denotes the vector of nuisance parameters. In general d_i is correlated to x_i, resulting in inconsistent estimates of β1,0 when yi is simply regressed on di. The

relationship of di to xican be formulated as

d_i= x_i0πd₀+ ρd_i, E[ρ_idx_i] = 0. (11) Simply applying the lasso procedure to acquire estimates of β1,0would be erroneous

due to the possibility of omitted variable bias resulting from estimating x0_iβ−_1,0 in a

high-dimensional setting (Chernozhukov et al., 2014). In order to eliminate omitted variable bias orthogonalized estimating equations for β1are needed.

The idea of Neyman (1959) was to project the score that identifies the target parame-ter onto the ortho-complement of the tangent space for the nuisance parameparame-ter. More

(22)

specifically, the objective is to find a score ψ(wi, β1, η), where wi = (yi, x 0 i)

0

and η the nuisance parameter, such that

E[ψ(w_i, β_1,0, η₀)] = 0 and ∂

∂ηE[ψ(wi, β1,0, η0)] = 0. (12)

The second equation in (12) forms the orthogonality condition. The interpretation of the orthogonality condition is that the equations are insensitive to first-order pertur-bations of the nuisance parameter η near its true value. This ensures that the effects of regularization on the estimates of η0 through penalization is sufficiently modest

for regular inference on the parameter of interest ˆβ1.

The estimators ˆβ1 solve the empirical analog of the first equation in (12). As a

consequence of the orthogonality property ˆβ1 is first-order equivalent to ˜β1, which

solves the infeasible _N1 Pn

i=1ψ(wi, ˜β1,0, η0) = 0.

Molding the above into the linear setting of this study it turns out that the orthogonal-ity equations are closely related to the well-known concept of partialling out. Using equation (11) and defining yi= x0iπ

y 0+ ρ

y

i analogously, the following relationship can

be composed:

ρy_i = β1,0ρdi + εi. (13)

Where both ρy_i and ρd_i result from partialling out the linear effect of xi on yi and di.

Note that E[ρy_ixi] = 0 so that ρ y i = yi−x 0 iπ y 0with x 0 iπ y

0 the linear projection of yi onto xi. After partialling out, β1,0is the regression coefficient in equation (13), a result that

is know as the Frisch-Waugh-Lovell theorem. The target parameter thus solves E[ρy

i −β1ρ d

i)ρdi] = 0. (14)

The resulting score associated to this equation agrees with the orthogonality princi-ple.

The estimated target parameter can now be used in straightforward inference such as the simple t-test as defined in Chapter 3. The following sections test this appealing property in the Monte Carlo experiment as described in Section 3.1. Again the rejection frequencies are obtained as a function of the lasso penalty lambda to analyze the size-correctness of inference on the parameter of interest, thus attaining the nominal rejection frequency of α = 0.05.

4.2 Significantβ1

This section treats the most simple case where β1 = 1 and β2,...,20 = 0 and a t-test

is carried out on the target parameter β1 after applying the orthogonality principle

(23)

rejection frequency of the t-test after using the orthogonality principle whereas the blue line shows the rejection frequency as if the t-test would have been used on the classic lasso estimate.

Figure 6 shows that when using the orthogonality principle the t-test performs very well, with a rejection frequency constantly around the nominal level of five per-cent. The rejection frequency of the simple t-test shows the same behavior as was established in Chapter 3, again performing poorly. Furthermore, the penalty set by the rigorous lasso is similar to the cross-validated penalty.

The size properties of this test are thus rather promising if the true model only consists of a significant target parameter. To see if these results can be extended into more general practical situations the next two sections test the performance of this method in the second and third case as defined in Chapter 3.

-5 -4 -3 -2 -1 0 1 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 19 19 15 14 13 5 3 1 1 1 1 1 1 1 1 1 Number of nonzero variables

Figure 6: Rejection frequency of H0: β1= 1 using OLS and orthogonality.

This section analyzes the simulation results of testing the hypothesis H0: β1= 1 when

there are small but significant nuisance parameters β2, ..., β6. The rejection

frequen-cies of a t-test with and without using the orthogonality principle are plotted in Fig-ure 7. Again the red curve shows the frequency when orthogonality is applied and the blue curve when it is not.

(24)

in-fluence the performance of the estimation method of Belloni, Chernozhukov, and Hansen (2013). The red curve is a bit more ragged for a moderately small penalty and does not reject enough for large penalties but still performs drastically better than the simple OLS based t-test given by the blue curve. Furthermore for the case of added nuisance parameters the rigorous lasso sets a higher penalty than cross-validation, but is consistent with the previous section.

Therefore in the more practically relevant case where there is a significant tar-get variable but the model holds some significant nuisance parameters as well, the size performance does not really suffer. However, different performance might be ex-pected when there is a strong degree of correlation in the data which is treated in the next section. -5 -4 -3 -2 -1 0 1 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 20 20 19 19 14 9 7 4 3 2 1 1 1 1 1 1 0 Number of nonzero variables

Figure 7: Rejection frequency of H0: β1= 1, added nuisance parameters.

The third case, where β = (1, 0, ..., 0) but the data is strongly correlated with ρ = 0.75, treated in this section is of special interest for the orthogonality principle. This setup might cause some disturbances in the rejection frequency since when there is a high degree of correlation the impact of partialling out the linear effect of the nuisance variables is large.

However, when observing the results in Figure 8 it becomes clear that, again, there is no significant deterioration in the rejection frequency when testing H₀ : β₁ = 1

(25)

using the estimates based on the orthogonality principle. Over the whole range of penalty values the rejection frequency stays between 4.5 and 6.2%, which is around the nominal level.

-6 -5 -4 -3 -2 -1 0 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 19 19 19 19 17 13 8 5 3 1 1 1 1 1 1 1 Number of nonzero variables

Figure 8: Rejection frequency of H0: β1= 1, correlated data.

Overall, the method using the orthogonality principle shows some bumps for moderate values of the lasso penalty in any case but performs really well, in terms of size at least. Having established that this method is size-correct, the next step is to analyze its power against alternative parameter values which the next section accounts for.

4.5 Power curves

As stated above this section analyzes the power of the t-test after applying the orthog-onality principle. The hypothesis tested remains to be H0: β1= 1 however the data

generating process is altered such that the true value of β1ranges between 0.90 and

1.10. Ideally the power rises from its size of 0.05 at β_1,0= 1 to 1 for even small changes in β1,0. To ascertain robustness for different values of the lasso penalty the power is

tested for λ = (e−1, e−2, e−3, e−4) which more or less contains the whole spectrum of estimators based on Chapter 3 and the previous sections in this chapter.

The power curves for these penalties are shown in the four subplots of Figure 9. In every subplot three curves are shown, the red curve gives the power in the first case where β = (1, 0, ..., 0), the blue curve gives the power in the second case

(26)

where β = (1, γ1, ..., γ5, 0, ..., 0) and, third, the black curve gives the power in the third

case where the data is correlated with correlation coefficient ρ = 0.75. To provide a benchmark the gray curve gives the power of the OLS t-test as if it would be the true model. This, theoretically, has the best attainable power and therefore the loss in power of using the lasso to arrive at the model can be assessed by comparison with this curve.

From the curves it is clear that the power of this method is rather robust to the arbitrary choice of λ, which is reassuring for the practitioner. Moreover, the power in the first two cases are close to equal, the power for the second case is slightly shifted to the right when the penalty is high but this difference is negligible. Both power curves are almost identical to the benchmark which shows that the power loss of using the lasso is close to zero, a desirable property. In contrast, the black curves show that, although the size remains correct if the variables are correlated, the power drops. The loss in power with correlated data is quite severe compared to the first two cases. Although the imposed correlation may be unrealistically high in this experiment, the data may in reality very well have some degree of correlation. Therefore the power loss provides some ground for concern.

(27)

0.90 0.95 1.00 1.05 1.10 0.0 0.2 0.4 0.6 0.8 1.0 (a) λ1= e−1 0.90 0.95 1.00 1.05 1.10 0.0 0.2 0.4 0.6 0.8 1.0 (b) λ2= e−2 0.90 0.95 1.00 1.05 1.10 0.0 0.2 0.4 0.6 0.8 1.0 (c) λ3= e −₃ 0.90 0.95 1.00 1.05 1.10 0.0 0.2 0.4 0.6 0.8 1.0 (d) λ4= e−4

Figure 9: Power curves for hypothesis H0: β1= 1 all 3 cases in one

Summarizing, applying the orthogonality principle as in Chernozhukov et al. (2014) before conducting inference on a target coefficient shows robustness to setting the lasso penalty λ. The method is in general size-correct and has good power properties, although correlated data might cause issues. In light of the practical applicability it is very useful that Chernozhukov, Hansen, and Spindler (2016a) provide a R-package that is ready to use so that any practitioner with basic programming affinity may use their estimation algorithm.

A drawback of the method might be that it is most useful in the case of a specific target variable, would one be interested in multiple coefficients based on scientific theory the algorithm has to be applied multiple times with the relevant variables treated as target variable in turns. Therefore a method that may be more general in applying statistical inference, so that for instance variables may be tested jointly, is of interest as well. Caner and Kock (2014) develop a test to do honest inference in such cases. Their method is explained and tested on the criteria of this study in the next chapter.

(28)

5 Conservative lasso and honest inference on high

dimen-sional parameters

This chapter analyzes a method for conducting statistical inference that improves the most straightforward methodology of applying the standard t-test in another dimen-sion than with using the orthogonality principle as in the previous chapter. Instead of altering the estimation procedure Caner and Kock (2014) alter the test itself to arrive at more trustworthy results.

Caner and Kock (2014) estimate honest confidence regions using the conservative lasso, which changes the estimation procedure of the classical lasso. In doing so they also provide a feasible uniformly consistent estimator of the asymptotic covariance matrix of an increasing number of parameters. This estimate forms the basis for a feasible χ2-test on a specific hypothesis test.

The next section covers the conservative lasso procedure and explains how the test is derived. The subsequent three sections perform the Monte Carlo experiment of this study set out in section 3.1, but with the t-test replaced by the χ2-test provided by Caner and Kock (2014) and the final section of this chapter analyzes the power of this test for different values of the lasso penalty.

5.1 Asymptotically feasible hypothesis test with conservative lasso This section follows the study of Caner and Kock (2014) since they provide the χ2 -test central to this chapter. The reader is referred to their study for analytical proofs of the results used in this section.

Consider again the CLRM y = Xβ + ε where β is the k × 1 population parameter vector. Furthermore, it assumed that the explanatory variables are exogenous. In practice the values of β are to be estimated and it is not known which coefficients are non-zero. The second chapter of this study explained how to use lasso estimation to reduce the model to its non-zero parameters, whereas in the previous chapter the rigorous lasso provided by Chernozhukov et al. (2016b) was used to do so. Caner and Kock (2014) provide yet another alternative for lasso estimation, which they name the conservative lasso. The next subsection briefly treats the conservative lasso and subsection 5.2.2 explains the derivations of their χ2-test.

5.1.1 Conservative lasso

The lasso penalty λ in the classical lasso is set at a specific level for all parameters. As explained in section 4.1.1 it might be optimal to be able to set different weights for different parameters. If different weights are allowed it is desirable to set higher weights on parameters that are truly zero with respect to weights on parameters that

(29)

are non-zero, however the practitioner does not know which parameters are non-zero and which are not.

To resolve this Caner and Kock (2014) introduce a two-step estimator, related to the adaptive lasso of Zou (2006), which they name the conservative lasso and is defined as ˆ β = min β Y −Xβ 2 + 2λ k X j=1 ˆ wj|βj|, (15) where ˆw_j = λprec

|_βˆ_L,j|∨_λ_prec gives the weight per parameter. The weights ˆwj in the second

step are obtained using classical lasso in the first step. The advantage that the con-servative lasso has with respect to other two-step lasso estimators is that it does not necessarily exclude parameters that were set to zero in the first step. It is therefore, in a sense, more conservative, hence its name. Furthermore, the design of the weights is so that the non-zero coefficients are never penalized more severe than the zero co-efficients which is the property that was aimed to attain.

The conservative lasso thus is appealing since it sets the weights more appropriate than the classical lasso, which should result in performance gains. To be able to test if indeed the conservative lasso is superior to the classical lasso the simulations in this chapter are performed for both estimation procedures, but first the test is derived. 5.1.2 Conducting inference using desparsification

In order to conduct inference on the estimates produced by the conservative and clas-sic lasso the idea desparsification from Van de Geer, B ¨uhlmann, Ritov, Dezeure, et al. (2014) is used. Note that this section follows Caner and Kock (2014) who apply desparsification on the conservative lasso instead of the classical lasso as Van de Geer et al. (2014) do. The idea is that the bias introduced due to shrinking as a consequence of penalization shows up in the properly scaled limiting distribution of ˆβ_j so that it can be removed before conducting statistical inference.

Define ˆW = diag( ˆw1, ..., ˆwk) as the k × k diagonal matrix containing the weights of

the conservative lasso. The first order condition of Equation (15) then can be written as

−1

nX

0

(Y − X ˆβ) + λ ˆW ˆκ = 0, k ˆκk∞≤1 (16)

with ˆκj = sign( ˆβj) if ˆβj, 0 for j = 1, ..., k. This can be rewritten as

λ ˆW ˆκ = 1 nX

0

(Y − X ˆβ). (17)

Then using Y = Xβ0+ ε and defining ˆΣ= 1_nX0X the above equation becomes λ ˆW ˆκ + ˆΣ( ˆβ − β0) =

1

nX 0

(30)

As is clear from the final equation above, the matrix ˆΣneeds to be inverted in order to isolate ˆβ − β0. However, when the number of parameters is larger than the number

of observations, ˆΣ is not invertible. The idea then is to approximate an inverse to ˆ

Σ and control its approximation error, define this approximate inverse by ˆΘ. The approximation can be done by applying nodewise regression as in Meinshausen and B ¨uhlmann (2006) and Van de Geer et al. (2014) in the conservative lasso setting.

Multiplying both sides in equation (18) with ˆΘand rewriting gives ˆ β = β0− ˆΘλ ˆW ˆκ + 1 nΘˆX 0 ε −√1 n∆, (19)

where ∆ =√n( ˆΘ ˆΣ −I_p)( ˆβ − β₀) is the error term that results from approximating the inverse of ˆΣ. Caner and Kock (2014) show that this error is asymptotically negligible. Furthermore since the bias term, λ ˆW ˆκ, is known it can be added to both sides of

Equation (19) resulting in the following estimator: ˆb = ˆβ + ˆΘλ ˆW ˆκ = β0+ 1 nΘˆX 0 ε −√1 n∆. (20)

Hence, for any k × 1 vector δ with unit length, a central limit theorem applies to

1 √

nδ 0 _ˆ

ΘX0ε so that in combination with the asymptotically negligible error term δ0∆ asymptotic Gaussian inference can be conducted. Note that the desparsified conser-vative lasso can then be obtained in practice from

ˆb = ˆβ +1

nΘˆX 0

(Y − X ˆβ). (21)

The asymptotic variance of√nδ0(ˆb−β₀) is obtained from the asymptotic properties of ˆΘ. Caner and Kock (2014) show that under certain assumptions

√ nδ0(ˆb − β0) q δ0_{Θ ˆ}ˆ_Σ xεΘˆ0δ d − →_{N (0, 1),} ₍₂₂₎

which is, in part, Theorem 2 from Caner and Kock (2014).

Let H = {j = 1, ..., p : δ_j , 0} with cardinality h = |H |, meaning that H contains the indices of the coefficients that are tested in the hypothesis at hand. Then

(Θ ˆˆΣxuΘˆ 0 )−_H1/2 √ n( ˆβH−β0,H) 2 2 d − →_χ2_(h), ₍₂₃₎

as it is asymptotically a sum of h independent standard normal random variables. This is the test statistic of interest in the simulations conducted in the following section. Note that in this study h = 1 since the interest lies in β1 only, but that a

(31)

5.2 Significantβ1

The most simple case is testing the hypothesis H0: β1= 1 where β = (1, 0, ..., 0) which

is tested using the above explained test of Caner and Kock (2014). Figure 10 shows the rejection frequency of the χ2-test, as defined in Equation (23), for this hypothesis. Again, the horizontal axis states the natural logarithm of the lasso penalty λ, the ver-tical axis the rejection frequency and on top of the graph are the number of nonzero variables corresponding to the penalty at that point. The blue curve gives the course of the rejection frequency using the classical lasso whereas the red curve does so for the conservative lasso.

Analyzing the figure three things strike immediately. First, the classical lasso behaves differently than in the previous chapters. Second, the two paths are very similar and, third, they are close to the nominal level of α = 0.05.

The size performance of the classical lasso stands in stark contrast to the previ-ous chapters, however this can be easily explained. In Chapter 3 the standard OLS

t-test is executed and however the method by Belloni, Chernozhukov, and Hansen

(2013) in Chapter 4 improves the honesty of inference it still uses the t-test so that the classical lasso performs poorly. In this chapter the test itself is different and is de-signed to improve the trustworthiness of the inference, therefore it is also applicable on the classical lasso estimator, not only on the by Caner and Kock (2014) provided conservative lasso.

The similarity of the two curves may be explained by the idea that the test works well inherent of how the lasso estimator is obtained. However, a second factor that is less obvious and is caused by the setup of this study is of influence. Indeed the con-servative lasso has a smarter way of attaining weights for the lasso estimator. How-ever, since the aim of this study is to obtain rejection frequencies as a function of the penalty, the Monte Carlo experiment puts heavy constraints on the freedom of the conservative lasso, therefore it may be forced to stay close to the classical lasso. Nonetheless, Figure 10 shows that there is a slight difference in the right tail of the curves.

Finally, since the χ2-test of Caner and Kock (2014) does undoubtedly well in at-taining the correct size in the most simple case it is interesting to test whether it sustains its size properties in the next two cases following this section.

(32)

-4 -3 -2 -1 0 1 2 0.0 0.2 0.4 0.6 0.8 1.0 log(λ) Rejection F requency 17 17 15 13 12 6 6 6 3 2 2 1 1 1 1 1 Number of nonzero variables

Figure 10: Rejection frequency of χ2-test for H0: β1= 1.

This section treats the second case, where five significant nuisance variables are added to the data generating process. The results are shown in Figure 11 where again the blue curve gives the rejection frequency of the χ2-test for the classical lasso and the red curve for the conservative lasso.

As in the first case the rejection frequencies show very similar behavior, but more importantly the added noise does not seem to influence the size properties signifi-cantly. The test slightly over rejects but is close to the nominal level. The right tail behavior is the same as in the previous section strengthening the argument that the conservative lasso is more robust to the choice of the lasso penalty.

A difference with the first case is that for the same penalty levels more variables are included in the model, this is an unsurprising consequence from adding nuisance variables but it shows once again that when there are more variables in the true model that play a significant role the resulting model is more comprehensive.

Having established that the χ2-test of Caner and Kock (2014) does not suffer from additional small but significant variables when conducting inference on a coefficient of interest it remains to test its behavior when the data is strongly correlated. This is treated in the next section.

(33)

Figure 11: Rejection frequency of χ2-test for H0: β1= 1 , added nuisance parameters.

The third case, where the data is correlated with ρ = 0.75, is analyzed in this section. The rejection frequencies of the χ2-test are given in Figure 12. Correlated regressors might cause problems since they increase the variance which may have an unpre-dictable effect on the approximated gram matrix, however since in this study’s Monte Carlo experiment p < n this should not be problematic here.

Observing the results in Figure 12 show that indeed the χ2-test sustains its size-correctness when the data is heavily correlated. Moreover, again both lasso estimators perform close to equal with exception of the right tail.

(34)

Figure 12: Rejection frequency of χ2-test for H0: β1= 1 with correlated data.

The size properties of the testing procedure in this chapter are appealing since for all three cases the size is close to the nominal level of five percent. Furthermore, not only the conservative lasso proves to behave well but also the classical lasso estimator has good size properties. The next step then is to assess its power and compare it to the results of the previous chapter which is the subject of the following section.

5.5 Power curves

Now that size-correctness of the χ2-test by Caner and Kock (2014) is shown in the previous sections the next step is to analyze its power. It is of interest how it performs in comparison to the t-test on the target variable after partialling out the effect of the nuisance parameters. Therefore the power is simulated according to the same setting as in Chapter 4 with the same values for the lasso penalty except that λ3 is omitted

for presentational convenience.

The red curve in Figure 13 shows the first case where β = (1, 0, ..., 0), the blue curve the second where β = (1, γ1, ..., γ5, 0, ..., 0) and the black curve shows the power when

the data is highly correlated. Again, as a benchmark the power of the OLS t-test as if it would be the true model is shown in gray. In addition to Chapter 4, since the classical lasso performs similar to the conservative lasso in terms of size, it is of interest if any difference arise in their power. Both are therefore analyzed in this section.

High dimensional inference using lasso estimation