Comparing Lasso and other shrinkage bases estimators : the effects of V-fold cross-validation

(1)

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

1. The thesis should have the nature of a scientic paper. Consequently the thesis is divided up into a number of sections and contains references. An outline can be something like (this is an example for an empirical thesis, for a theoretical thesis have a look at a relevant paper from the literature):

(a) Front page (requirements see below)

(b) Statement of originality (compulsary, separate page) (c) Introduction (d) Theoretical background (e) Model (f) Data (g) Empirical Analysis (h) Conclusions

(i) References (compulsary)

If preferred you can change the number and order of the sections (but the order you use should be logical) and the heading of the sections. You have a free choice how to list your references but be consistent. References in the text should contain the names of the authors and the year of publication. E.g. Heckman and McFadden (2013). In the case of three or more authors: list all names and year of publication in case of the rst reference and use the rst name and et al and year of publication for the other references. Provide page numbers.

2. As a guideline, the thesis usually contains 25-40 pages using a normal page format. All that actually matters is that your supervisor agrees with your thesis.

3. The front page should contain:

(a) The logo of the UvA, a reference to the Amsterdam School of Economics and the Faculty as in the heading of this document. This combination is provided on Blackboard (in MSc Econometrics Theses & Presentations).

(b) The title of the thesis

(c) Your name and student number (d) Date of submission nal version

(e) MSc in Econometrics

(f) Your track of the MSc in Econometrics

Comparing Lasso and other shrinkage based

estimators

The effects of V-fold cross-validation

Maurice Chin Ten Fung (10617566)

Date: August 13, 2017

Master’s programme: Econometrics

Specialisation: Big Data Business Analytics

Supervisor: dhr. dr. N.P.A. (Noud) van Giersbergen

Second reader: dhr. prof. dr. F.R. (Frank) Kleibergen

(2)

Statement of Originality

This document is written by Maurice Chin Ten Fung who declares to take full responsi-bility for the contents of this document. I declare that the text and the work presented in this document is original and that no sources other than those mentioned in the text and its references have been used in creating it. The Faculty of Economics and Business is responsible solely for the supervision of completion of the work, not for the contents.

(3)

1 Introduction

In order to obtain a greater understanding of a certain system by modeling the system, it must first be quantified. As soon as it is quantified, it is possible to develop an econometric model to help explain the sytem, study different parts of the system or to predict the behaviour in the system.

In today’s world we are greatly dependent on these econometrical models. These models are the base of every quantitative based research and therefore it is rather important that the models used in several different situations are correctly specified. If this is not the case, the conclusions of such research can be flawed. To prevent this from happening there is usually some sort of model selection or regularization criteria casted onto the model. Before we can start to take a deeper look into these methods, we first have to understand the most basic econometric estimation technique.

The most basic econometrical method used is the Ordinary Least Squares (OLS) method. To

use this method we must have data that is specified by (y, X), in which y = (y1, y2, . . . , yN)

represents the dependent variable and X = (x1, x2, . . . , xp) is the matrix of regressor vectors.

The model will then be as follows:

y = Xβ + ε

where β = (β1, β2, . . . , βp)T are the true coefficients for the regressors and ε is the vectorized

error term. The estimates for the OLS coefficients are obtained by minimizing the residual

squared error: minb(y − Xb)T(y − Xb), which is solved by b = (XTX)−1(XTy).

As this is the most basic model, this is usually also the starting point for many researchers, however, Tibshirani (1996) states that there are two reasons why data analysts are often not satisfied with the OLS estimates, and these two reasons are prediction accuracy and inter-pretation. Tibshirani states that the prediction accuracy can be improved by setting some coefficients towards 0, as this sacrifices a little bias in order to reduce the variance of the estimates. To improve the interpretation of the model he states that a variable selection of some sort is required. This is especially the case when the set of regressors is rather large, and it is possible to determine a subset of the regressors that have the biggest influence on the dependent variable.

Two methods to counter these potential flaws in the OLS model are subset selection and ridge regression. Both of these methods, however, also have their drawbacks. The first

(5)

method can yield rather variable models, as a small change in data can result into an en-tirely different subset of independent variables, which can cause lower prediction accuracy. The second method however does respond correctly to these small changes in data as it is a continuous method. Ridge regression is a shrinkage method where coefficients are getting shrinked towards 0, but will not be set to 0, which leaves the model to be difficult to interpret.

Tibshirani (1996) has introduced a new method, which is derived from the ridge regression, called the lasso estimator. This is an abbreviation for ‘least absolute shrinkage and selection operator’. This method takes the good parts of both of the methods mentioned above, as it is a shrinkage method, but it also performs variable selection by setting coefficients at 0, and therefore makes the model easier to interpret.

The lasso method estimates its parameters by minimizing a term that looks similar to

the residual squared errors, however, now there is an added penalty term λPp

j=1|βj|. This

factor P |βj| is also known as an `1 penalty. The difference between the lasso method and

the ridge regression is the penalty term, where the lasso method uses the `1 penalty term,

the ridge regression uses an `2 penalty term which is defined as P βj2. Apart from these

differences, either of these shrinkage methods will only perform well with a properly chosen value for λ.

There is however not one exact method to determine this value for λ. The method that is most frequently used in practice is 10-fold cross-validation. Using this method alongside the lasso model yields β-parameter results which correctly distinguish the signal variables from noise variables according to James, Witten, Hastie, and Tibshirani (2013). Even though this cross-validation specification is the most frequently used method, there are also other capable cross-validation specifications. When we talk about cross-validation we still have to specify the V-fold in which we execute the validation. A study on the properties of the cross-validated lasso is done by Chetverikov and Liao (2016). One of the results they have shown is that the cross validated λ performs better than the λ following the Bickel-Ritov-Tsybakov rule as proposed by Bickel, Ritov, and Tsybakov (2009).

Besides the Ridge regression and the lasso methods, there is another shrinkage based es-timation technique that is rather similar to the earlier mentioned methods. Zou and Hastie (2005) introduced the Elastic Net estimator. All three of the estimation techniques perform different, so it is quite interesting to find out which of the three methods yields the best model, with the highest prediction accuracy and interpretability. The main focus of this paper will be on how these three estimation techniques compare to one another in different

(6)

situations.

As the choosing of the value for λ is of quite importance for the lasso and the other estimation techniques, we are also going to look into whether or not the prediction accuracy of the models improve when we opt for a different cross-validation specification.

The rest of the paper is organized as follows. In the next section we will take a deeper look into the lasso estimator and its related estimation techniques. We will also study the cross-validation method used for finding a value for λ. Then in section 3 we are going to discuss the methods we are going to use to compare the different models and the way we find λ. In section 4 we will describe the results obtained by our experiments. Finally in section 5 we will summarize the results.

(7)

2 Theoretical background

2.1 Shrinkage based estimation techniques

As said before, in this section we will talk more about the lasso estimator and two related estimation techniques. Before we start looking at the lasso estimator itself, we are first going to take a look at the ridge regressor method. This ridge regressor is estimating its parameters using the following minimization criteria:

ˆ βR= argβmin β {(y − Xβ) T_{(y − Xβ)}} subject to X j β_j2 ≤ t

In this minimization t is a tuning parameter for which t ≥ 0 must hold.

The lasso was first introduced by Tibshirani (1996) as he wanted to improve the ridge regres-sion method by also having it doing some variable selection. The lasso estimator is defined as follows ˆ βL= argβmin β {(y − Xβ) T_{(y − Xβ)}} _{subject to} X j |βj| ≤ t

In the lasso method t is a similar tuning parameter as in the ridge regression method.

As stated earlier, the difference between the ridge regression and lasso are the penalty terms

where the ridge regression uses an `2 penalty term and the lasso uses an `1 penalty term.

Even though the difference between the two methods is this small, it is able to change the results of the lasso estimator with regards to the ridge regression quite drastically, as within the ridge regression coefficients will be pushed towards zero and in the lasso method some coefficients will be set at zero. This difference can best be explained by taking a look at figure 1, where we use p = 2 regressors. On the left-hand side of the figure we see the con-tour lines of the residual squared error function centered around the OLS-estimate and the constraint function for the lasso. On the right-hand side we see the same, but only for the ridge regressor. The lasso estimator is defined as the spot where a contour line first touches the blue diamond on the left-hand side of the figure. As there is a corner in this shape, the place where the two lines touch will sometimes occur at one of the axis and therefore setting another coefficient at 0. On the right-hand side however, the blue shape contains no corners at all. This implies that the place where the contour line and the constraint function touch will almost never lie on one of the axis and therefore will almost never set a coefficient exactly

(8)

at 0.

Figure 1: Lasso vs. Ridge regressor

Contour lines of the residual squared errors and the constraint functions of the lasso and ridge regressor left and right respectively. Figure reprinted from An introduction to statistical learning by James et al. (2013,

p. 222)

This setting of ridge regressor and lasso estimator only contains t as a tuning parameter, whilst we are more interested in a different form of these methods, which contains the pa-rameter λ. In practice this different form of either methods is generally used, which is most likely being done as it is more computational friendly. This will also be the forms that we are going to study in this paper. The ridge regression estimator can now be written as:

ˆ βR= argβmin β {(y − Xβ) T_{(y − Xβ) + λ} p X j=1 β_j2}

and for the lasso estimator we can write: ˆ βL= argβmin β {(y − Xβ) T_{(y − Xβ) + λ} p X j=1 |βj|}

(9)

to be large enough, the OLS-estimate ˆβ will be part of the set of outcomes allowed by the constraint function. This means that the lasso estimator or the ridge regression estimator will be equal to the OLS-estimate. Now the tuning parameter t in the initial model can be somewhat compared to the regularization parameter λ in the practical model, as a high value for t, which can be compared to a small value for λ, both cause the lasso estimator and the ridge regression estimator to yield the same results as the OLS-estimator. We can therefore conclude that it is rather important that the value for λ is chosen well.

Zou and Hastie (2005) state that even though the lasso has shown good performance in many situations, it still has its limitations in certain scenarios. One of the scenarios is where there are more potential regressors than observations, so p > N . In this case the lasso is not able to identify more than N variables. This is an undesirable property in case we have a dataset where a lot of variables can in reality contribute to the dependent variable.

Another scenario is when there is a dataset which contains regressors where the regressors are rather highly correlated. The lasso estimator will pick only one of just a few of these correlated regressors, whilst more of them could be important for the dependent variable.

Due to the shortcomings of the lasso estimator in these circumstances Zou and Hastie pro-posed another method, which should perform better in these circumstances. the method they proposed is called the elastic net and looks a lot like both the ridge regression and the lasso. To obtain the elastic net estimator they first obtain the naive elastic net estimator, which is formulated as follows: ˆ βN = argβmin β {(y − Xβ) T (y − Xβ) + λ2 p X j=1 β_j2+ λ1 p X j=1 |βj|}

As you can see this naive elastic net estimator is minimizing the residual sum of squares plus the penalty terms used in both the ridge regression and the lasso estimator. This naive elastic net estimator is a two-stage procedure, it starts off by finding its ridge regression coefficients

for fixed λ2 values and than it does the lasso shrinkage. It seems that the naive elastic net

estimator suffers from a double amount of shrinkage. This double amount of shrinkage does not really reduce the variance within the estimator, but it does introduce unnecessary extra bias and therefore the double shrinkage must be corrected. To correct this Zou and Hastie

(10)

estimator: ˆ βEN = (1 + λ2) ˆβN = (1 + λ2) argβmin β {(y − Xβ) T_{(y − Xβ) + λ} 2 X j=1 β_j2 + λ1 p X j=1 |βj|}

Zou and Hastie state that such a scaling factor preserves the variable selection nature of the naive elastic net estimator, yet it is also the simplest way to undo the excess amount of shrinkage. They chose this rescaling factor considering they normalized their data as such

that P iyi = 0 and P ixij = 0 and P ix 2

ij = 1. This elastic net estimator now uses λ1 and

λ2 and can be seen as a two-step estimator. However, in the end, the values for λ will still

(11)

2.2 Cross-validation methods for finding λ

As stated earlier the most common procedure to select the value for λ is cross-validation. The philosophy behind cross-validation originates from Larson (1931) where he noticed that training an algorithm and testing the algorithm on the same dataset would yield too opti-mistic results. To get a better idea of the performance of an algorithm, the idea was raised to test it on a new set of data. However in practice there is only a limited amount of data available and therefore it is not always a possibility to test the algorithm on new data. To counter this, people started using the cross-validation method.

The idea behind cross-validation, as explained by Arlot and Celisse (2010), is to split the data. After splitting the data, a big chunk of the data is used to train the algorithm and then the leftover data is used as a test set. This is then repeated a certain amount of times, depending on the specifications of the cross-validation method. Using this method we will find more than one estimate for the algorithm and this allows us to find the errors for each individual estimate. This way we can find the algorithm with the lowest error. In the case of the lasso and the elastic net, we can find the values for λ which achieve the lowest mean squared error when estimating the model.

In this paper we will be using V-fold cross-validation and leave-one-out cross-validation. V-fold cross-validation is the cross-validation where the data gets split into V parts. After splitting the data into V parts V − 1 parts will form the training set upon which an algorithm gets trained, and the last part is the training set. This process will be repeated V times to get every chunk of data to be the test set once. Now after having repeated the training and testing of the algorithm V times, we can choose the model which has the lowest squared prediction error based on the different test sets. One of the most common used form of V-fold cross-validation is 10-fold cross-validation.

Considering we are interested in finding the regularization parameter λ for models like the lasso we take a look at the study performed by Chetverikov and Liao (2016) where they applied V-fold cross-validation upon the lasso model. To apply V-fold cross-validation they used the following formula:

ˆ β−v(λ) = arg min β∈Rp 1 N − Nv (y−v− X−vβ)T(y−v− X−vβ) + λ p X j=1 |βj| !

In this formula y−v and X−v are the observations which are not included in the test set and

(12)

find the cross-validation choice of λ we use the following formula: ˆ λ = arg min λ∈Λn V X v=1 (yv− Xvβˆ−v(λ))T(yv− Xvβˆ−v(λ))

where Λn is a set of candidate values for λ and yv and Xv represent the observations in the

test set. What happens in the two above mentioned formulae, is that we use a predetermined

set of candidate values for λ to pick the cross-validated ˆλ from. This is done by checking for

which value of λ from the candidate set we find the lowest squared prediction error using the cross-validation estimator. This cross-validation estimator leaves out the test set observations to obtain an estimate for β.

Now the lasso estimator with the cross-validation value for λ is given by ˆβL(ˆλ), which is

the original lasso estimator given on page 5 using the cross-validation value for λ.

The other cross-validation method we are going to discuss is the leave-one-out method. This method, unlike the V-fold cross-validation, does not split the dataset into a set amount of chunks. The leave-one-out method, as the name suggests, takes the entire dataset but one ob-servation as the training set and has the one obob-servation as the test set. We can now consider the leave-one-out cross-validation method as a specific case of the V-fold cross-validation. If we take V = N we will have the leave-one-out estimator as each individual observation will be used as the entire test set once. Using this method we will be able to find the model with the lowest error within the dataset, but it will be computationally heavy as opposed to the V-fold cross-validation. Whilst using this method we will be likely to find the model with the lowest error within the dataset, it still has to show how significant the improvement is with regards to using the V-fold cross-validation method.

In the next section we are going to discuss the different scenarios in which we will com-pare the ridge regression, lasso and elastic net amongst eachother. We will also discuss how we are going to compare these methods in each different scenario. Lastly we are going to discuss the methods we are using in order to investigate whether or not the model accuracies will improve if we change the cross-validation specification.

(13)

3 Methodology

3.1 Simulations

In order to obtain a good impression of the performances of the three different estimation techniques, it is important that they all get tested in more than one situation. Therefore we are going to produce three different kinds of data generating processes (DGPs), inspired by Chetverikov and Liao (2016), and use each of these three DGPs with six combinations of number of parameters and training set size.

In each DGP we generate the covariate vectors X from the Gaussian distribution with mean zero. We simulate ε using the fact that we set ε ∼ N (0, 1). In the true model we have the following β vector:

β = (0.8, 0.7, 0.6, 0.5, 0.4, 0.3, 0.2, 0.1, 01×(p−8))T

So the true model will then be given by y = Xβ + ε.

The difference between the three DGPs will be that in DGP 1 the regressors will not be correlated between themselves, in DGP 2 they will be slightly correlated and in DGP 3 there will be high correlation amongst the regressors. We choose these differences to be able to determine whether or not the elastic net does outperform the lasso when there is a certain amount of correlation between the regressors.

As already mentioned, the three DGPs will differ in the amount of correlation there is amongst the regressors. This implies for DGP 1 that the variance-covariance matrix is specified as

E(xixj) = 1, ∀ i = j and E(xixj) = 0, ∀ i 6= j.

Then for the second DGP, in which there will only be a slight amount of correlation

be-tween the regressors, we have a variance-covariance matrix specified as E(xixj) = 0.25|i−j|, ∀ i, j.

Finally, in the third DGP, the variance-covariance matrix is given by E(xixj) = 0.75|i−j|, ∀ i, j.

As for the six combinations of number of parameters and training set size we will be using

NT = {100, 400} and p = {40, 100, 400}. We use these combinations to be able to estimate

(14)

3.2 Method comparison and cross-validation

In order to apply the three estimation methods with these DGPs we are going to make use of the statistical software package R, in which we are going to make frquent use of the glmnet package. This glmnet package was created by Friedman, Hastie, and Tibshirani (2010) and is currently maintained by Hastie. This glmnet package allows us to easily estimate not only the lasso, but also the ridge regressor and the elastic net methods. In order to apply all three of the techniques easily it solves a different problem. It minimizes the following criterion (Hastie and Qian (2014)) :

1 2N N X i=1 (yi− xiβ)2+ λ[(1 − α)/2 X j β_j2+ αX j |βj|]

In this criterion α is a parameter which is defined as λ1

λ1+λ2 and therefore 0 ≤ α ≥ 1. This

way the criterion is either the Ridge regression (α = 0), the lasso (α = 1) or the Elastic Net (0 < α < 1). To minimize this criterion function the glmnet package uses the coordinate descent method.

Before we use this method we are going to sample our dataset into a training set and a test set. We are doing this such that we can achieve more reliable values for the MSE for

dif-ferent values of α. As we already mentioned, our test set will be of the size NT = {100, 400},

however to get to this training set we will first generate a much larger set. This larger set

consists of N = 10000 + NT observations.

Now we have split our observations into a training set and a test set, we can compute the MSE of the various α’s on the test sets. The various values for α that are going to be tested

are α = {0, 0.1, 0.2, . . . , 1}. We are going to compute the MSE for each α for each {NT, p}

combination. Due to the random nature of the sampling of the training set, we are going to select the best α by using cross-validation. This implies that we are going to resample the training set 50 times and calculate the MSE of the estimators on the test set. We will then choose the value for α which has the lowest mean MSE over all 50 replications. After we find

the α for which the glmnet package finds the lowest MSE in each {NT, p} combination, we

will continue with only this value of α.

Now that we have determined which value for α is ideal for each {NT, p} combination we

(15)

for V is the optimal one, we are once more going to use the glmnet package. This package has a built-in cross-validation algorithm in which we only have to specify the value for V .

To choose the optimal V we start with defining a a set of possible V values. For

NT = 100 we defined the set as V = {5, 10, 15, 20, . . . , 100} and for NT = 400 we defined

it as V = {10, 20, 30, 40, 100, 200, 300, 400}. For the latter we chose bigger steps in order to

decrease the computation time, which is quite high for large values of V . For NT = 100

we chose to start with 5-fold cross-validation, just like Chetverikov and Liao (2016), and subsequently increase the folds with steps of 5.

Now to compute which value of V yields the lowest MSE we are going to go through the same process as we did to compute the optimal α. As it is possible that we will find different

values of V yielding the lowest MSE for a certain {NT, p} combination, we are going to

cal-culate the MSE multiple times after resampling the training set. However, due to the longer computation times to find the best V , we are not going to replicate this calculation 50 times like we did in order to find the optimal α, but we are going to replicate this calculation 25

times for each {NT, p} combination. Following the way we chose the optimal value for α we

are going to choose the optimal value for V by selecting the V which yields the lowest mean MSE over all 25 replications.

Once we have defined the estimation specification which yields the lowest MSE we are going to compare this specification with the OLS method. As it is likely that our best specification will have conducted some variable selection, it is only fair to use exactly the same variables in the OLS method. So now the OLS method will be applied onto a modified X matrix which will only contain the variables selected by one of the shrinkage operators. Now we are able to compare the two methods fairer than when we would have used the full X matrix for the OLS method.

In the next chapter we will present the results we found following the simulation studies by using the methodologies and techniques discussed in this chapter. These results will also be widely analyzed and discussed.

(16)

4 Results and analysis

In this chapter we are going to present the results. We are going to start with discussing

the results for the various values of α which we found for all three DGPs in every {NT, p}

combination. Next we are going to take a look at the values we found for V . Finally we are going to compare the results of the estimation method with the optimal α and V with the OLS method.

4.1 Optimal α specification

The first step towards finding the best specified estimation technique is finding the optimal value for α for the dataset. We have computed four values in order to measure which value for α is the optimal. The first measure, denoted as Min(mean), is the value of alpha for which we found the lowest mean value of the MSE over all 50 replications. This measure can also be presented more graphically in the following graph:

Figure 2: Mean MSE per α over 50 replications

0.0 0.2 0.4 0.6 0.8 1.0 1.45 1.50 1.55 1.60 1.65 1.70 1.75 alphas A v er age of MSE f or the 100 sim ulations

(17)

In this graph we find the mean of all 50 MSE’s per value of α in DGP 1 for {NT, p} =

{100, 40}. The same graph for the different situations can be found in the appendix.

The second measure is the Mean(min) value. This value is computed by taking the mean of the values for alpha which yield the lowest MSE in each replication. This can also be shown by the following histogram:

Figure 3: Histogram of how often each α value yields the lowest MSE Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 14

Like the previous figure, this figure is also generated with the {NT, p} = {100, 40}

com-bination for DGP 1. Once more, the same graph for the other situations can be found in the appendix.

The third measure can also be derived from the histogram shown above. The third measure is the value of α which yields the lowest MSE most frequent, which in the above histogram is clearly α = 1.

The fourth and final measure is denoted as Abs. min, which is the value of α which yields the absolute lowest MSE in all of the 50 replications.

(18)

In the following table all of the four measures can be found for al {NT, p} combinations

for DGP 1. In the end we chose the first measure, Min(mean), as the decisive measure upon which we decided which value for α is the best in each situation.

Table 1: Alpha value for DGP 1

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 1 1 0.8 0.9 1 0.9

Mean(min) 0.754 0.864 0.858 0.734 0.78 0.85

Most freq. min 1 1 0.9 0.9 0.7 0.9

Abs. min 1 0.9 1 0.8 1 0.9

Best method Lasso Lasso Elastic Net Elastic Net Lasso Elastic Net

From this table we conclude that in 3 of the 6 situations the Lasso was chosen as best method and in the other 3 methods the Elastic Net came out as best method. We also imme-diatly see that if we based the decision of the best method on one of the three other measures we would have gotten completely different results. However, we think that, considering the fact that we are cross-validating, the first measure is the most reliable option. Seeing that if we for example take the second measure as our decisive measure we would never really obtain the Lasso estimation method, which would not be correct.

We also notice, that when p NT the Elastic Net method performs best, which is in line

with the theory.

In the following table we find the four measures for DGP 2, from which the results should indicate that the Elastic Net performs better than the Lasso as there is a small correlation in DGP 2.

(19)

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 1 0.9 0.9 0.7 1 1

Mean(min) 0.694 0.806 0.856 0.724 0.718 0.848

Most freq. min 0.7 1 0.9 0.7 0.7 0.8

Abs. min 0.6 1 1 0.8 0.9 0.8

Best method Lasso Elastic Net Elastic Net Elastic Net Lasso Lasso

We do however notice that there are still only 3 situations in which the Elastic Net per-forms the best, whilst in the other 3 situations the Lasso perper-forms best. This is once more due to the fact that we used measure one as the decisive measure, because if we take a look at the second measure, we see that for DGP 2 each value for alpha is a little further away from 1 than in DGP 1, which indicates that the Lasso will perform a little worse in each

situation. We still see that when p NT the Elastic Net performs the best, but now we also

see the Elastic Net performing the best in the {400, 40} situation.

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 0.7 1 0.8 0.4 1 0.8

Mean(min) 0.602 0.62 0.674 0.568 0.648 0.666

Most freq. min 1 0.7 0.6 0.7 1 0.6

Abs. min 0.6 0.6 0.6 0.3 0.8 1

(20)

In Table 3 we see the results of DGP 3. We see that there are now 4 situations where the Elastic Net performs better, which is a logical result following the theory that the Elastic Net performs better when the regressors are correlated. We also see that the trend continues when we look at the second measure. In each situation the second measure is now lower than it was in DGP 2.

As we used the first measure as decisive measure, we will now continue using only these values of α in each situation. Now we know the values for α we can go on and find the ideal values for V .

4.2 Cross-validation specification

In order to obtain the optimal value for V we opted for the same four measurements. We chose the Min(mean) measurement as the decisive one just like we did to find the optimal α.

Table 4: Optimal V value for DGP 1

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 90 100 55 200 300 200

Mean(min) 16.2 17.8 35.4 30.4 34.4 22

Most freq. min 5 5 10 10 20 20

Abs. min 15 5 80 10 30 20

In the table shown above we find the results for DGP 1. What we immediatly notice is that there is quite a big difference between the first measure and the three other measure-ments. This big difference is most likely due to only replicating the calculation 25 times. We notice in this table that in none of the situations V = 10 came out as best, even though this is the most widely used cross-validation specification. It is also clear that when the test set is bigger, it pays off to have more splits. This is however in absolute numbers, relatively speaking it is the other way around.

(21)

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 60 25 70 100 300 300

Mean(min) 19.6 18.6 28 23.2 42.4 35.2

Most freq. min 5 10 25 10 20 20

Abs. min 5 20 10 30 30 30

In Table 5 the results for DGP 2 are shown. We still notice that in none of the situations V = 10 is the best choice. The results look quite similar to those of DGP 1.

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

Min(mean) 35 80 45 40 300 30

Mean(min) 20.4 15.4 25.4 18 26.8 18.5

Most freq. min 5 5 5 10 10 10

Abs. min 30 10 45 10 10 20

Now for the results of DGP 3 we have Table 6. What stands out in this table is that the optimal V values seem to lie quite a bit lower than in the other 2 DGPs. In particular

the results for {NT, p} = {400, 40} and {NT, p} = {400, 400} seem to be on the low side

compared to earlier results.

Now that we have found which value for V is optimal, we have not yet answered the question on how each value of V compares to 10-fold cross-validation. In order to compare these values we have constructed the following graph:

(22)

Figure 4: Percentual change in MSE for each V value 20 40 60 80 100 −1.5 −1.0 −0.5 0.0 0.5 V P

ercentual change in MSE

In this graph we find the percentual changes for the average MSE of each V value with regards

to V = 10. This graph shows the {NT, p} = {100, 400} situation for DGP 1. In this graph

we see that V = 5 yields a slightly higher average MSE of just 0.5 percent more than the average MSE of V = 10. From the graph we can clearly conclude that there is a downwards trend concerning the average MSE when V increases.

To show the effects of increasing V when the test set is larger we are also going to look

(23)

Figure 5: Percentual change in MSE for each V value 0 100 200 300 400 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 V P

ercentual change in MSE

In this graph the downwards trend is not so clear. The average MSE value certainly de-creases when V inde-creases, but it does not keep decreasing up until we reach the leave-one-out specification.

Even though it is clear that the average MSE is decreasing in V , we still are not sure whether increasing V is worth the extra time. As we see from the graphs the difference between the average MSE for V = 10 and higher values of V is at most 1.5 percent, which is not a rather big difference, even though the computation time increases significantly.

(24)

4.3 Optimal estimation method specification

Now we have determined the optimal α and V for each situation in every DGP, we are able to estimate the model within this specification. After fitting the best method we are left with a set of lambdas which yield different models.

Figure 6: MSE of different lambdas for optimal specified model

−6 −5 −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 log(Lambda) Mean−Squared Error 40 40 40 40 40 38 36 33 31 28 25 21 15 9 8 7 3 3 1 Lasso

In this graph we see the MSE by the lasso estimator for {NT, p} = {100, 40} for DGP

1. In this graph, the black dot represents the MSE of the final model, which is chosen by moving one standard error away of the minimal MSE. The minimal MSE is given at the first vertical stripe and the second stripe is the MSE one standard error further. This value of λ is chosen because this value yields a low MSE whilst not selecting too many variables for the model. On top of the graph are numbers, which denote how many variables there are left in the model for the corresponding value of λ, and we see that at the minimal MSE this number is around 18, which is much larger than the 9 it is at the second vertical stripe.

After calculating the optimal α and V for each situation we ended up with theoritically the best models. Now we are going to compare the results of these optimal models with the

(25)

OLS model if we applied OLS on the same subset of variables that we ended up with in the optimal model.

Table 7: Model estimation results for DGP 1

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

# NZ Regressors 9 10 13 10 8 6

MSE optimal model 1.3806 1.41197 1.88591 1.103655 1.084202 1.205926

MSE OLS model 1.18643 1.14915 1.694469 1.05064 0.99055 1.06262

In this table we find the number of non-zero coefficients generated by our shrinkage based estimator. Subsequently we find the MSE of the model estimated using the optimal α and λ we found earlier and the MSE of the OLS model. What stands out is that for every situation the OLS method yields a lower MSE. The good performance of the OLS model can be ex-plained by the fact that the shrinkage method already performed a good variable selection. In this first row of the table we can see the number of non-zero coefficients and this number is never far off from the true 8 non-zero coefficients, therefore the OLS method is able to estimate rather accurate coefficients. In Table 8 and 9 below, we find the same results but then for DGP 2 and 3.

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

(26)

{NT, p} {100, 40} {100, 100} {100, 400} {400, 40} {400, 100} {400, 400}

MSE OLS model 1.07383 1.05593 1.07118 1.03726 1.01795 1.02605

For DGP 2 and 3 we notice similar results as for DGP 1. The MSE of the OLS model is always lower than the MSE of the shrinkage based estimator. We do now notice that in almost every situation we end up with less non-zero coefficients than there are in the true model. This probably is due to the correlation between the regressors, which causes some of the regressors to be not picked up by the variable selection procedure.

(27)

5 Conclusion

In this paper we focused on shrinkage based regression estimators as an alternative to the standard OLS regression method. Tibshirani (1996) was the first to define the lasso estima-tor as an adapted version of the ridge regressor. These shrinkage based estimaestima-tors shrink coefficients towards zero and the lasso even performs variable selection. As an alternative to the lasso and the ridge regressor Zou and Hastie (2005) propositioned the elastic net, which is a combination of the two.

The three shrinkage based regression methods all perform different from one another. We tried to find out which performed the best in what kind of situation. What we found is that the ridge regressor is always outperformed by either the lasso or the elastic net. In order to create different situations we used three different DGPs where the first had zero correlation between the regressors, the second had low correlation and the third had high correlation. We also used different training set sizes and different numbers of regressors.

After determining which α value, and thus shrinkage method, yielded the best fit for each seperate situation, we found that it was clear that when the number of regressors is greater than the amount of observations in training the elastic net method performs best. Ultimately, it seems that the lasso performs just as well as the elastic net, as they ended up being the best methods in almost the same amount of situations.

All three of the shrinkage based estimators are dependent on a regularization parameter λ, and therefore it is important how we compute this parameter. In order to obtain an accurate value for λ we used V -fold cross-validation. We calculated the prediction accuracy for different values of V for the same situations we used to find the best α.

We noticed that the MSE is decreasing in V for NT = 100 and for NT = 400 it

de-creases at first, but then rises back up. Seeing this, we could conclude that it is better to increase the V value from the most popular choice V = 10, but the MSE only decreases a

small amount, with 5 percent being the largest decline measured at {NT, p} = {100, 400}

from DGP 2. What does increase significantly, however, is the computation time when we increase V . Therefore, it might not be worth it to increase the value of V for the better prediction accuracy.

After we found both the optimal α and V values we ended up with the optimal shrink-age based regression model for each situation. We found that when the correlation between

(28)

the regressors got bigger, the number of covariates selected by the shrinkage method de-creases. In DGP 2, with low correlation between the regressors, we found the number of non-zero coefficients to be the closest to the true number of non-zero coefficients, whilst in DGP 1 there are too many non-zero coefficients and in DGP 3 there are too few.

Now we have the optimal shrinkage based regression models we were able to compare them with OLS models. We did find that the OLS models were more accurate than the shrinkage based models. This is most likely due to the fact that we only used the by the shrinkage selected covariates in the OLS models.

(29)

Bibliography

Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4 , 40–79.

Bickel, P. J., Ritov, Y., & Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 1705–1732.

Chetverikov, D., & Liao, Z. (2016). On cross-validated lasso. arXiv preprint

arXiv:1605.02214 .

Friedman, J., Hastie, T., & Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33 (1), 1.

Hastie, T., & Qian, J. (2014). Glmnet vignette. Technical report, Stanford.

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). Springer.

Larson, S. C. (1931). The shrinkage of the coefficient of multiple correlation. Journal of Educational Psychology, 22 (1), 45.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), 267–288.

Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67 (2), 301– 320.

(30)

Appendix

On the following pages you can find graphs showing information for each combination of

{NT, p} from each DGP.

The top left graph is a histogram which shows the frequency of each value for α yielding the lowest MSE for all 50 replications. The top right graph shows the average MSE for each value of α over all of the 50 replications. The bottom left graph shows the final shrinkage based model with the black dot representing the lambda chosen in the final model. The botom right graph denotes the average percentual change of MSE for each value of V with regards to V = 10 over all 25 replications.

(31)

Figure 7: {NT, p} = {100, 40} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 1.45 1.50 1.55 1.60 1.65 1.70 1.75 alphas A v er age of MSE f or the 100 sim ulations −6 −5 −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 log(Lambda) Mean−Squared Error 40 40 40 40 40 38 36 33 31 28 25 21 15 9 8 7 3 3 1 Lasso 20 40 60 80 100 −1.5 −1.0 −0.5 0.0 0.5 V P

(32)

Figure 8: {NT, p} = {100, 40} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.35 1.40 1.45 1.50 1.55 1.60 alphas A v er age of MSE f or the 100 sim ulations −6 −5 −4 −3 −2 −1 0 1 2 3 4 5 log(Lambda) Mean−Squared Error 40 39 39 39 39 37 36 33 30 27 21 16 13 9 6 6 6 5 4 3 0 Lasso 20 40 60 80 100 −1.5 −1.0 −0.5 0.0 0.5 V P

(33)

Figure 9: {NT, p} = {100, 40} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 1.25 1.30 1.35 alphas A v er age of MSE f or the 100 sim ulations −6 −4 −2 0 1 2 3 4 5 6 7 log(Lambda) Mean−Squared Error 39 37 37 36 36 37 35 31 26 24 19 14 98 8 6 66 6 5 54 3 Elastic Net 20 40 60 80 100 −1.0 −0.5 0.0 0.5 V P

(34)

Figure 10: {NT, p} = {100, 100} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 1.6 1.8 2.0 2.2 2.4 2.6 2.8 alphas A v er age of MSE f or the 100 sim ulations −4 −3 −2 −1 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log(Lambda) Mean−Squared Error 85 78 75 70 63 58 53 43 33 23 18 11 8 6 5 4 4 2 0 Lasso 20 40 60 80 100 −2 −1 0 1 2 V P

(35)

Figure 11: {NT, p} = {100, 100} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 1.5 2.0 2.5 3.0 3.5 alphas A v er age of MSE f or the 100 sim ulations −3 −2 −1 0 1 2 3 4 5 log(Lambda) Mean−Squared Error 68 57 53 44 34 27 21 16 12 8 6 6 5 4 4 4 4 2 1 0 Elastic Net 20 40 60 80 100 −2.0 −1.5 −1.0 −0.5 0.0 V P

(36)

Figure 12: {NT, p} = {100, 100} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 6 7 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 alphas A v er age of MSE f or the 100 sim ulations −3 −2 −1 0 1 2 4 6 8 10 log(Lambda) Mean−Squared Error 51 44 35 32 24 18 14 9 8 7 7 6 6 6 6 6 4 4 4 3 2 0 Lasso 20 40 60 80 100 −1.5 −1.0 −0.5 0.0 V P

(37)

Figure 13: {NT, p} = {100, 400} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 1.8 2.0 2.2 2.4 2.6 2.8 3.0 alphas A v er age of MSE f or the 100 sim ulations −4 −3 −2 −1 0 1.5 2.0 2.5 log(Lambda) Mean−Squared Error 96 92 91 89 84 82 78 70 66 59 45 38 31 26 18 13 8 5 3 3 3 1 0 Elastic Net 20 40 60 80 100 −2 0 2 4 6 V P

(38)

Figure 14: {NT, p} = {100, 400} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 0.0 0.2 0.4 0.6 0.8 1.0 1.5 2.0 2.5 3.0 3.5 4.0 alphas A v er age of MSE f or the 100 sim ulations −3.0 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 1.5 2.0 2.5 3.0 3.5 4.0 log(Lambda) Mean−Squared Error 78 70 65 64 54 51 40 31 20 16 13 7 5 5 4 2 2 2 2 1 1 Elastic Net 20 40 60 80 100 −5 −4 −3 −2 −1 0 1 V P

(39)

Figure 15: {NT, p} = {100, 400} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 2 3 4 5 6 alphas A v er age of MSE f or the 100 sim ulations −3 −2 −1 0 1 2 4 6 8 log(Lambda) Mean−Squared Error 73 70 66 62 57 46 37 24 17 9 9 7 7 7 7 7 7 7 4 3 2 Elastic Net 20 40 60 80 100 −2.5 −2.0 −1.5 −1.0 −0.5 0.0 0.5 V P

(40)

Figure 16: {NT, p} = {400, 40} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.13 1.14 1.15 1.16 1.17 1.18 alphas A v er age of MSE f or the 100 sim ulations −6 −5 −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 3.5 log(Lambda) Mean−Squared Error 38 38 37 37 36 35 33 29 25 20 14 10 8 6 6 6 5 4 2 Elastic Net 0 100 200 300 400 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 V P

(41)

Figure 17: {NT, p} = {400, 40} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.11 1.12 1.13 1.14 1.15 1.16 1.17 alphas A v er age of MSE f or the 100 sim ulations −6 −5 −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log(Lambda) Mean−Squared Error 38 38 38 37 36 35 33 30 26 17 10 8 7 6 6 6 5 5 3 0 Elastic Net 0 100 200 300 400 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 V P

(42)

Figure 18: {NT, p} = {400, 40} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1.10 1.11 1.12 1.13 alphas A v er age of MSE f or the 100 sim ulations −5 −4 −3 −2 −1 0 1 2 2 4 6 8 10 log(Lambda) Mean−Squared Error 37 37 37 35 34 29 25 20 17 13 9 9 7 7 7 7 6 6 6 6 5 5 3 Elastic Net 0 100 200 300 400 −0.35 −0.30 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 V P

(43)

Figure 19: {NT, p} = {400, 100} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 1.15 1.20 1.25 1.30 1.35 alphas A v er age of MSE f or the 100 sim ulations −7 −6 −5 −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 3.5 log(Lambda) Mean−Squared Error 100 98 98 98 95 93 87 80 75 62 48 33 14 8 7 6 6 5 4 2 0 Lasso 0 100 200 300 400 −1.5 −1.0 −0.5 0.0 V P

(44)

Figure 20: {NT, p} = {400, 100} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 0.0 0.2 0.4 0.6 0.8 1.0 1.10 1.15 1.20 1.25 1.30 alphas A v er age of MSE f or the 100 sim ulations −7 −6 −5 −4 −3 −2 −1 0 1 2 3 4 log(Lambda) Mean−Squared Error 100 99 99 97 96 89 82 78 66 53 36 23 13 7 7 7 6 6 4 3 2 Lasso 0 100 200 300 400 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 V P

(45)

Figure 21: {NT, p} = {400, 100} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 0.0 0.2 0.4 0.6 0.8 1.0 1.08 1.10 1.12 1.14 1.16 1.18 1.20 alphas A v er age of MSE f or the 100 sim ulations −6 −4 −2 0 2 4 6 8 log(Lambda) Mean−Squared Error 100 98 96 88 78 68 60 48 29 16 11 10 8 8 7 6 6 6 6 4 1 Lasso 0 100 200 300 400 −0.5 −0.4 −0.3 −0.2 −0.1 0.0 V P

(46)

Figure 22: {NT, p} = {400, 400} from DGP 1 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 0.0 0.2 0.4 0.6 0.8 1.0 1.5 2.0 2.5 alphas A v er age of MSE f or the 100 sim ulations −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 log(Lambda) Mean−Squared Error 276 244 209 173 126 97 57 28 13 6 6 6 6 6 4 4 3 3 0 Elastic Net 0 100 200 300 400 −1.5 −1.0 −0.5 0.0 V P

(47)

Figure 23: {NT, p} = {400, 400} from DGP 2 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 12 14 0.0 0.2 0.4 0.6 0.8 1.0 1.5 2.0 2.5 3.0 alphas A v er age of MSE f or the 100 sim ulations −4 −3 −2 −1 0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 log(Lambda) Mean−Squared Error 259 221 186 148 87 55 30 16 8 7 7 7 7 6 5 4 3 3 2 Lasso 0 100 200 300 400 −1.2 −1.0 −0.8 −0.6 −0.4 −0.2 0.0 V P

(48)

Figure 24: {NT, p} = {400, 400} from DGP 3 Histogram of minalphanum minalphanum Frequency 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 1 2 3 4 5 alphas A v er age of MSE f or the 100 sim ulations −3 −2 −1 0 1 2 4 6 8 log(Lambda) Mean−Squared Error 125 98 68 47 21 13 9 8 7 7 7 7 7 7 6 6 6 6 5 3 3 3 Elastic Net 0 100 200 300 400 −0.25 −0.20 −0.15 −0.10 −0.05 0.00 0.05 V P

Comparing Lasso and other shrinkage bases estimators : the effects of V-fold cross-validation

Faculty of Economics and Business

Amsterdam School of Economics

Requirements thesis MSc in Econometrics.

Comparing Lasso and other shrinkage based

estimators

Maurice Chin Ten Fung (10617566)

Date: August 13, 2017

Master’s programme: Econometrics

Specialisation: Big Data Business Analytics

Supervisor: dhr. dr. N.P.A. (Noud) van Giersbergen

Second reader: dhr. prof. dr. F.R. (Frank) Kleibergen

Contents

1

Introduction

2

Theoretical background

2.1

Shrinkage based estimation techniques

2.2

Cross-validation methods for finding λ

3

Methodology

3.1

Simulations

3.2

Method comparison and cross-validation

4

Results and analysis

4.1

Optimal α specification

4.2

Cross-validation specification

4.3

Optimal estimation method specification

5

Conclusion

Bibliography

Appendix