Background and Extensions of Lasso Applied to Bond and Stock Returns

(1)

Background and Extensions of

Lasso Applied to Bond and Stock

Returns

Friday, July 17, 2015

(2)

(3)

Background and Extensions of

Lasso Applied to Bond and Stock

Returns

Friday, July 17, 2015

Nick W. Koning

Abstract

(4)

1 Introduction

The term “Lasso”, an acronym for Least Absolute Shrinkage and Se-lection Operator, was coined by Tibshirani (1996) in an attempt to popularize his new finding. Lasso is a technique that simultaneously performs both variable selection and model fitting. Although Lasso was not immediately appreciated due to the lack of an efficient compu-tational algorithm, scarceness of high-dimensional datasets and under-standing of the properties of Lasso, it is now seen as an essential tool in Machine Learning, Signal Processing and many other fields (Tibshi-rani, 2011).

The main purpose of this report is to discuss the most important re-sults related to Lasso with the intention to give an introduction to the technique, particularly its use in linear models. It is specifically aimed at those who are familiar with econometrics and consider using Lasso.1 Additionally, an application of Lasso is considered in which the rela-tionship between the prices of U.S. treasury bonds and a large number of stock prices is investigated. Specifically, the research question is: “How does Lasso perform in variable selection and prediction on stock market data?”. In this application, Lasso is able to achieve a 21% to 28% lower prediction error than OLS, while using only between 1% and 50% of the available variables.

The remainder of this report is organized as follows. In Section 2, two related techniques are discussed to provide a background in variable se-lection and prediction. Then, Section 3 introduces Lasso and compares it to the related techniques. Next, in Section 4 literature about

con-1_{For a more advanced discussion of Lasso beyond linear models, the reader is}

(7)

(8)

2 Historical Approach

In this section I start by demonstrating two problematic cases for Or-dinary Least Squares (OLS) and discuss some historical techniques to approach these issues.

2.1 Two problems with OLS

Consider a linear model with p parameters, for which there are n ob-servations

y = Xβ + ε, (1)

where n-vector y is the response vector, n × p-matrix X is the input matrix and n-vector ε contains errors with mean zero and some variance σ2. To estimate the parameters of this model it is common to use OLS, which minimizes the Residual Sum of Squares:

ˆ

β_OLS= arg min

β

{(y − Xβ)0(y − Xβ)}

A property of OLS is that it yields unbiased estimates with the lowest variance among linear unbiased estimators. While this is a very attrac-tive property for an estimator, there exist biased estimators with lower variance which may be preferable in some cases.

(9)

variables, this may still result in a poor prediction of the response, as ˆ

y = X ˆβ_OLS aggregates the elements of ˆβ_OLS. Hence, OLS may suffer from bad prediction accuracy, as poor coefficient estimates may over-shadow accurate estimates.

For the next example, consider a different linear model with p param-eters and n observations

y = Xθ + ,

similar to model (1), but with the prior knowledge that θ is a sparse vector, which implies that some (many) of its elements are exactly equal to zero. However, it is not known which elements are zero or how many elements are zero. This is the standard model considered in variable selection, where the variables corresponding to the zero values in θ are considered to be irrelevant variables and those corresponding to non-zero values to be relevant.

Asymptotically, OLS could identify the relevant variables by estimating their coefficients as zero. However, OLS estimates will not be exactly equal to zero in practice. Even if one would look for variables close to zero OLS performs poorly, especially if p approaches n. In the case that p > n, so there are more variables than observations, OLS is not well defined.

(10)

2.2 Subset Selection

To be able to select the relevant variables, a traditional approach is sub-set selection. In subsub-set selection, all possible subsub-sets of the available set of input variables are considered as candidate models. The fitness of each candidate model is then measured and compared to select the optimal subset. One approach could be to compare the maximum like-lihood values of the candidate models. Returning to the standard linear model as given in (1) and assuming normality of the errors, the negative log-likelihood for some estimator ˆβ of β is proportional to the Residual Sum of Squares

RSS( ˆβ) = (y − X ˆβ)0(y − X ˆβ).

However, using this value directly as a measure of performance is prob-lematic, since the inclusion of an extra variable will not increase the RSS, regardless of the relevance of the included variable. But while the RSS may not be a good way to measure the fitness of a model by itself, it does form an important component of many other criteria.

This can be seen in some of the most well known approaches to variable selection, including Mallows’s Cp (Mallows, 1973),

Cp( ˆβ) =

(y − X ˆβ)0(y − X ˆβ) ˆ

σ2 − n + 2k,

and the Akaike (Akaike, 1974) and Bayesian (Schwarz, 1978) Informa-tion Criteria

(11)

where k is the number of variables included in the evaluated model and ˆ

σ2 _{an estimator for σ}2_{. The AIC and BIC, also known as penalized}

likelihood criteria or `0-penalties, contain the RSS as a measure of

per-formance and penalize for the number of parameters in the model. The name `0-penalty comes from the zero power in the following notation.

A different way of writing estimates of the criteria is

ˆ β_` 0(λ) = arg min β ( (y − Xβ)0(y − Xβ) + λ p X i=0 β_i0 ) = arg min β (y − Xβ)0 (y − Xβ) + λkβk0₀ ,

where βiis the ith element of β, λ is a parameter and kβk00 =

Pp

i=0β 0 i =

k.2 The AIC then corresponds to λ = 2σ2 and BIC to λ = log(n)σ2 (van der Geer and B¨uhlmann, 2011). The usefulness of this notation will become apparent in Section 2.3. For more information about the AIC and BIC in variable selection see Burnham and Anderson (2004).

Using one of the criteria, one may compare all candidate models and select the one with the lowest value. The comparison of all possible subsets of variables and selecting the candidate model with the mini-mum criterion value is called best subset selection. As the number of possible combinations of variables grows exponentially as p increases, this becomes infeasible for p > 45, even when using efficient branch-and-bound algorithms (Hofmann et al., 2007).

Quick alternatives are forward- and backward-stepwise selection. In forward-stepwise selection, one starts with only the intercept and it-eratively adds the variable that decreases the criterion the most. The procedure is stopped when every variable to be added would increase

(12)

the criterion value. Similarly, backward-stepwise selection starts with all parameters in the model and iteratively deletes the variable for which the deletion would result in a decrease in the criterion value. Note that unlike forward-stepwise selection, backward-stepwise selection requires that the number of variables is smaller than the number of observa-tions, to allow the starting model to be evaluated. A combination of the two is achieved by bidirectional-stepwise selection, which removes the worst and adds the best performing variable in each iteration.

While these directional stepwise selection procedures are feasible even for very large values of p, the restricted path sacrifices performance (Hocking, 1976). Finally, subset selection in general has problems with high variability due to discreteness of the selection process, as a variable is either included or not (Breiman, 1995).

2.3 Ridge Regression

Named after the related Ridge analysis, Ridge regression was initially developed to deal with non-invertible X0X matrices in OLS (Hoerl and Kennard, 1970). However, it was quickly discovered to have some appealing properties over ordinary least squares.

Compared to OLS, Ridge regression has a smaller variance which makes it a popular tool in predictive modelling, it is robust to multicollinearity (even perfect multicollinearity) and hence also for data where p > n. To illustrate the usefulness of Ridge regression, I will start by giving the definition of the Ridge regression, after which I show some equivalent formulations, before I go into its properties.

(13)

contain an intercept and is standardized (such that its columns have mean zero and a variance equal to one) and the response vector y is centred around zero. The need for standardization and centring will become clear later. The closed form solution of Ridge regression is given by

ˆ

β_RR(λ) = [X0X + λI]−1Xy, λ ≥ 0, (2) where I is the p × p identity matrix and λ ≥ 0 is a tuning parameter. From this formulation, it can be seen that the invertibility of (X0X + λI) is guaranteed by a positive value for λ. From this formulation Ridge regression may seem like an arbitrary method to guarantee invertibility, but its appealing properties become more intuitive from the following formulation ˆ β_RR(λ) = arg min β (y − Xβ)0(y − Xβ), s.t. p X i=1 β_i2 ≤ t, t ≥ 0, (3) or ˆ β_RR(λ) = arg min β (y − Xβ)0 (y − Xβ) + λkβk2₂ (4) = arg min β {(y − Xβ)0(y − Xβ) + λβ0β} , λ ≥ 0, (5)

where k · k2 is the `2-norm and t a tuning parameter. There exists a

one-to-one relationship between t and λ, however this relationship de-pends on the data.

(14)

∂{(y − Xβ)0(y − Xβ) + λβ0β}

∂β0 = −2X

0

(y − Xβ) + 2λβ.

Equating to zero and solving for β yields (2).

In the expression between the accolades in (4) the RSS can be recog-nised, in addition to a term depending on λ and β. Similarly to AIC and BIC, Ridge regression is a penalized likelihood function. How-ever, instead of `0-penalization, Ridge regression is also known as `2

-penalization since it uses the `2-norm.

The use of this penalty emphasizes the requirement to standardize the input matrix beforehand, which allows the coefficient values to be com-pared on the same scale. Additionally, the response vector y is centred to avoid penalization of the intercept.

Unlike the criteria discussed in Section 2.2, Ridge regression does not perform variable selection. Instead of penalizing models with more variables, the penalty term of Ridge regression shrinks the values of the coefficient estimates. The amount of shrinkage is based on the tun-ing parameter λ. This shrinkage of the coefficients causes a downwards bias in the Ridge estimates. However, while Ridge Regression is biased, it also comes with a reduction in variance compared to OLS estimation.

Hoerl and Kennard (1970) show that the variance of the Ridge re-gression is

(15)

where W = (X0X + λI)−1. Its bias is Bias( ˆβ) = −λW β.

The squared bias is a monotonically increasing function in λ, but the variance decreases monotonically in λ. This corresponds to the obser-vation that in the case where p < n, Ridge regression converges to the unbiased OLS as λ → 0 (t → ∞). Conversely, λ → ∞ (t → 0) would yield all zero coefficients (and zero variance). Hence, the tuning pa-rameter λ allows for a trade-off between bias and variance. From this, Hoerl and Kennard (1970) show that there always exists a λ > 0 such that the Ridge regression has a lower variance than OLS.

In conventional econometrics, Ridge regression is not commonly used as the introduced bias is an unappealing property when an accurate estimation of the coefficients is required. However, while the choice between OLS and Ridge regression may seem like a choice between an unbiased and a biased estimator, Ryan (1997) remarks that this choice is misleading since OLS is also biased in practice if the true underlying model is unknown.

(16)

3 Lasso

Lasso (Tibshirani, 1996) is a continuous variable selection tool, sharing some favourable properties with subset selection and Ridge regression. Like subset selection and Ridge regression, Lasso is a form of penalized likelihood, but instead of using an `0- or `2-penalty, Lasso fills the gap

between these penalties by using an `1-penalty.

Lasso also lies between subset selection and Ridge regression in its properties: rather than selecting variables discretely, Lasso shrinks the coefficient estimates towards zero in a way that often causes some coef-ficient estimates to be exactly equal to zero. This property makes Lasso a tool that is able to improve the prediction accuracy, while simultane-ously selecting a parsimonious model. In this section, I will first define the Lasso, after which I compare it to the previously discussed Ridge regression and subset selection. The proceeding sections are dedicated to its performance in variable selection and prediction, selection of the tuning parameter, computational algorithms and extensions.

3.1 Definition

As in Ridge regression, a requirement for Lasso is that the input matrix X is standardized and the response vector y is centred around zero, to allow for the coefficients to be compared on the same scale and to prevent the intercept from being penalized. This is assumed for all further data to be discussed in this report. The Lasso coefficients are then given by

ˆ

β_L(λ) = arg min

β

(17)

where λ is a tuning parameter and || · ||1 the l1-norm. Unfortunately,

due to the discontinuity of the penalty term, which prevents taking the derivative, Lasso lacks a closed form solution. An equivalent formula-tion as a quadratic program with a single constraint is

ˆ β_L(t) = arg min β (y − Xβ)0(y − Xβ), s.t. p X i=1 |βi| ≤ t, (7)

where t ≥ 0 is the tuning parameter. Similarly to Ridge regression, a one-to-one relation exists between the parameters t and λ, where this relationship depends on the data.

As can be seen most intuitively from (7), the value of the tuning param-eter regulates the shrinkage of the coefficients. For example if p < n, choosing t > ||β_OLS||1 (λ = 0) is equivalent to OLS, but choosing t = 0

(λ → ∞) is equivalent to setting all coefficients equal to zero. This stresses the importance of the value of the tuning or shrinkage pa-rameter. In Section 6, I further discuss the selection of this shrinkage parameter

3.2 Comparison with Ridge regression

Unlike Ridge regression, Lasso has a tendency to set the coefficient val-ues of “less relevant” parameters equal to zero. This gives Lasso its variable selection properties, which Ridge regression lacks. An insight into why this property of the l1-penalty, does not hold for the l2-penalty

(18)

con-straint |β1|+|β2| ≤ t (left) and the Ridge penalty constraint β12+β22 ≤ t

(right). Since the feasible region for Ridge regression is shaped like a disc, the contour will in practice never touch the region at a point for which either of the coefficients is exactly equal to zero. On the other hand, the feasible region for Lasso is shaped like a diamond with its corners located on the axes. This shape makes it likely that the contour touches the region at one of the corners. As the corners are located on the axes, the touching of a corner will set one of the coefficients exactly equal to zero.

(19)

3.3 Comparison with subset selection

Compared to subset selection, Lasso gains its largest advantage in terms of computation time. As discussed in Section 2.2, the number of pos-sible candidate models for a dataset with p variables is 2p. Having to compare all possible candidate models, this results in a high compu-tation time even for a small number of variables. In practice, subset selection becomes infeasible for p > 45 with the use of advanced tech-niques (Hofmann et al., 2007).

(20)

4 Variable Selection

Due to its significant computational advantage over subset selection, the variable selection properties of Lasso have been thoroughly studied. In this section I will discuss some of the most important results about variable selection with Lasso. I start with Knight and Fu (2000), who analyze some asymptotic properties and provide a sufficient condition for estimation consistency in the classical case of a fixed number of variables p. Next, I treat the findings by Zou (2006), who shows that the results by Knight and Fu do not imply consistency in variable selection. Finally, I explain the results of Zhao and Yu (2006), who provided an “almost necessary and sufficient” condition for consistency of Lasso in variable selection.

4.1 Asymptotic Distribution

In this subsection, I discuss the results by Knight and Fu (2000), who studied the asymptotic distribution of Bridge regression (Frank and Friedman, 1993), for which Lasso is a special case. I will only treat the special case of Lasso.

Assume that data is generated by the linear model

y = Xβ + ε, (8)

where y is the n-vector containing the centred response,

X = ( ˜X1, . . . , ˜Xp) = (x01, . . . , x 0 n)

0

is the n × p standardized input matrix with columns ˜Xi, i = 1, . . . , p

(21)

an n-vector containing the errors with mean 0 and some variance σ2_. Assume that lim n→∞C n₌ X 0_X n → C ≥ 0, (9) and lim n→∞ 1 n1≤i≤nmaxxix 0 i → 0. (10)

The Lasso estimator is given by

ˆ

βn_L = arg min

β

{(y − Xβ)0(y − Xβ) + λn||β||1} ,

where λn is allowed to depend on the number of variables. This is

necessary for asymptotic results, as the number of observations will be allowed to grow. Lasso would be consistent if

ˆ

βn_L→p β.

To consider the consistency, Knight and Fu define the function

Zn(φ) = 1 n(y − X 0 φ)0(y − X0φ) + λn n ||φ||1,

minimized at φ = ˆβn_L. They then prove the result shown in Theorem 1 to show that ˆβn_L is consistent if λn = o(n).

Theorem 1. If C > 0 and λn/n → λ0 ≥ 0, then ˆβ n L →p arg min φ (Z) where Z(φ) = (φ − β)0C(φ − β) + λ0||φ||1.

(22)

Even though λn = o(n) guarantees consistency, λn is required to grow

more slowly for √n-root consistency. However, if it grows too slowly, the asymptotic distribution will simply be identical to that of OLS, since the penalty term would be eliminated asymptotically. Knight and Fu propose the condition λn/

√

n → λ0 ≥ 0, such that an“interesting”

asymptotic distribution may be found, shown in Theorem 2.

Define sign[·] to return the sign of a value or zero if it equals 0, and I{c} to be the indicator function, which equals 0 if the condition c is

false and 1 if it is true. Knight and Fu then formulate the following theorem, which gives an asymptotic distribution for Lasso.

Theorem 2. If λn/ √ n → λ0 ≥ 0 and C > 0, then √ n( ˆβn_L− βn_{) →} d arg min u (V ), where V (u) = −2u0W + u0Cu + λ0 p X j=1 ujsign[βj] + |uj|I{βj=0} ,

W has a N(0, σ2C) distribution and λ0 is some non-negative constant.

The proofs of Theorem 1 and Theorem 2 are omitted but can be found in Knight and Fu (2000).

4.2 Variable selection consistency

(23)

Consider again the linear model as in (8), but now the true coeffi-cient vector β is assumed to be sparse, such that many of its elements are zero, but it is unknown which elements are zero or non-zero. This is a common assumption in variable selection, where the purpose is to discover which variables belong in the true model and which are ir-relevant. Though this assumption may seem restrictive, Hastie et al. (2009) argue to: “use a procedure that does well in sparse problems, since no procedure does well in dense problems”, where a dense prob-lem is a coefficient vector with few or no zero eprob-lements.

Without loss of generality, let β = (β1, . . . , βq, βq+1, . . . , βp)0, such that

βi 6= 0 for i = 1, . . . , q and βi = 0 for i = q + 1, . . . , p. So, for ease

of notation, it is assumed that the first q values are unequal to zero and the values indexed from q + 1 to p are zero. Some estimator ˜βn with elements ˜βn

i , i = 1, . . . , p is then said to be consistent in variable

selection if

P ({i : ˜β_in6= 0} = {i : βi 6= 0}) → 1, as n → ∞. (11)

Using this definition for consistency in variable selection, Zou (2006) argues that under the condition specified in Theorem 2, Lasso is never consistent in variable selection. Specifically, he proves Theorem 3.

Theorem 3. If λn/ √ n → λ0 ≥ 0, then lim sup n P ({i : ˜β_in 6= 0} = {i : βi 6= 0}) ≤ c < 1,

where c is a constant depending on the true model.

(24)

In addition, he finds a necessary condition for Lasso to be consistent in variable selection, but this result is not as impressive compared to conditions found simultaneously by Zhao and Yu (2006).

Zhao and Yu (2006) propose the Irrepresentable Condition, which is almost necessary and sufficient for Lasso to select the true model as n → ∞ for both a setting with fixed p and large p. To present this condition and its implications, some notation must first be introduced. In the setting of large p, it is assumed that the true coefficient vector βn and input matrix Xn are indexed by n to allow their dimensions to change as n grows. Similarly to the previous setting, a sparse coefficient vector is used.

Specifically, it is assumed that βn = (βn

1, . . . , βqn, βq+1n , . . . , βpn)0, where

β_jn 6= 0 for j = 1, . . . , q and βn

j = 0 for j = q + 1, . . . , p. Let

βn₍₁₎ = (β₁n, . . . , β_qn)0 and βn₍₂₎ = (β_q+1n , . . . , β_pn)0. Furthermore, denote Xn(1) and Xn(2) as the first q and last p − q columns of Xn,

respec-tively. Recall the notation Cn= 1 nX

0

nXn. Next, let Cn be partitioned

such that Cn= " Cn₁₁ Cn₁₂ Cn₂₁ Cn₂₂ # , where Cn₁₁= 1_nXn(1)0Xn(1), Cn22= 1 nXn(2) 0_X n(2), Cn12= 1 nXn(1) 0_X n(2)

and Cn₂₁= _n1Xn(2)0Xn(1). Here, Cn11 is assumed to be invertible.

Zhao and Yu then define the Irrepresentable Condition as follows: There exists a positive constant vector η

(25)

where sign[βn] = (sign[βn

1], . . . , sign[βqn])

0_{, 1 is a (p − q)-vector of 1’s}

and the inequality holds elementwise.

Note that Cn₂₁ is the covariance matrix between the columns of the irrelevant input variables Xn(2) and relevant input variables Xn(1).

Furthermore, (Cn₁₁)−1 is the inverse of the covariance matrix between the columns of the relevant input variables Xn(1). Zhao and Yu then

give the following reasoning for the naming of the Irrepresentable Con-dition. They claim that by restricting the product of these two factors in 12, the condition restricts “the total amount of an irrelevant covari-ate represented by the covaricovari-ates in the true model” to 1.

To consider the implications of this condition on consistency in variable selection, Zhao and Yu consider the stronger case of Sign Consistency for Lasso, defined as

lim

n→∞P (sign[ ˆβ n

(λn)] = sign[βn]) = 1,

which does not only require the correct elements to be equal to zero, but also requires the non-zero elements to have the correct sign.

First, I revisit the setting of fixed p (and q), where the true coeffi-cient vector β is fixed and the assumptions (9) and (10) hold. Zhao and Yu then prove the following theorem

Theorem 4. For fixed q, p and βn = β, under regularity conditions (9) and (10), Lasso is strongly sign consistent if the Irrepresentable Condition holds. That is, when the Irrepresentable Condition holds, ∀λn that satisfies λn/n → 0 and λn/n

1+c

2 → ∞ with 0 ≤ c < 1, we have

P (sign[ ˆβn(λn)] = sign[βn]) = 1 − o(e−n

c

(26)

The proof of Theorem 4 is omitted and may be found in (Zhao and Yu, 2006).

Combining this result with the result in Theorem 1 from Knight & Fu, implies that Lasso is both consistent in variable selection and parame-ter estimation under the Irrepresentable Condition. Additionally, Zhao and Yu show that in the fixed p case, Lasso is consistent in variable selection only if the following slightly weaker version of the Irrepre-sentable Condition holds:

|Cn₂₁(Cn₁₁)−1sign[βn₍₁₎]| < 1.

As this condition is very similar to (12) with the exception of a minor technicality, they claim that the Irrepresentable Condition is “almost” necessary and sufficient for Sign Consistency.

In the case of large p and q, the dimensions of βn and Cn may grow as n grows, so qn and pn are indexed by n. This renders conditions (9)

and (10) unsuitable, since Cnno longer converges to a fixed matrix and the elements of βn may change. Recall that ˜Xn_i denotes a column of Xn. Zhao and Yu then make the following assumptions to guarantee appropriate behaviour of the dimensions and data. Assume that there exist 0 ≤ c1 < c2 ≤ 1 and M1, M2, M3 > 0, such that:

1 n( ˜X n i) 0 _˜ Xn_i ≤ M1, ∀i, (13) α0Cn₁₁α ≥ M2, ∀||α||22 = 1, (14) qn= O(nc1), (15) n1−c22 min i=1,...,q|β n i | ≥ M3. (16)

(27)

Theorem 5. Assume εn

i are i.i.d. random variables with finite 2k’th

moment E(n

i)2k < ∞ for an integer k > 0. Under conditions (13),

(14), (15) and (16), Irrepresentable Condition implies that Lasso has sign consistency for pn = o(n(c2c1)k). In particular, ∀λn that satisfy

λn √ n = o(n c2−c1 2 ) and 1 pn( λn √ n) 2k _{→ ∞, we have}

P sign[ ˆβn] = sign[βn]≥ 1 − o(e−nc3

) → 1 as n → ∞

The proof of Theorem 5 is omitted and can be found in Zhao and Yu (2006). This implies that Lasso is also consistent in model selection in the case of large p and q under the Irrepresentable Condition, if some moments of the noise terms are finite.

These results by Zhao and Yu (2006) are of high importance, because they confirm empirical evidence that Lasso performs well in variable selection. With the Irrepresentable Condition it is clear when Lasso will asymptotically select the true model, as well as when it will not select the true model asymptotically.

(28)

5 Prediction Error

The prediction properties of Lasso are not yet well understood (Dalalyan et al., 2014; Hebiri and Lederer, 2013) and many results are very re-cent. The recent literature is focused on finding upper bounds for the prediction error of Lasso. To measure prediction error, the common method is to use the Mean Squared Prediction Error (MSPE), defined as

MSPE( ˆβ) = kX( ˆβ − β)k2₂,

for some estimator ˆβ for β. Most results about the prediction error of Lasso use the MSPE.

The results about the prediction performance are divided in two cat-egories of bounds: the “slow” rate bounds and “fast” rate bounds. The slow rate bounds restrict the MSPE by some value proportional to the the shrinkage parameter λ. The fast rate bounds restrict the MSPE proportional to the square of λ. It is interesting that the bounds are lim-ited by the shrinkage parameter: a higher value for λ implies a stronger penalization and hence a more parsimonious model, but a trade-off is made in the prediction error which grows in λ. This trade-off is also noticed to some extent in the application in Section 9.

(29)

5.1 Slow Rate Bound

These bounds are called “slow”, as they converge at a rate of pq_n, where q is the number of relevant parameters. The slow rate bounds are proportional to the shrinkage parameter λ, which means that the MSPE is bounded by a value that grows linearly in λ. A slow rate bound given by Koltchinskii et al. (2011) is described as follows.

Define the set T = λ : sup β 2σ|εXβ| kβk ≤ λ

. They then show that the MSPE has bound

kX( ˆβ_L− β)k2

2 ≤ 2λkβk1.

where λ ∈ T . The proof of this bound is omitted and may be found in Koltchinskii et al. (2011).

5.2 Fast Rate Bound

The fast rate bounds are proportional to λ2 _{and converge much faster}

than pq

n, though not all fast rates converge at the same rate. A fast

rate bound as given by Koltchinskii et al. (2011) is defined as follows.

Let J denote a subset of {1, . . . , p}, let J0 contain the indices of the

non-zero elements of β, and let Jcbe the complement set of J . Further-more, let | · | denote the cardinality of a set. The restricted eigenvalue assumption from Bickel et al. (2009) is then made, which is given by

(30)

Assume that ¯s ≥ |J0|, then the fast rate bound as given by Koltchinskii

et al. is

kX( ˆβ − β)k2₂ ≤ λ

2_s_¯

nφ2_(¯_s),

where λ ∈ T . The proof of this bound is omitted and may be found in Koltchinskii et al. (2011).

5.3 Practical note on bounds

(31)

6 Practical Shrinkage Parameter Selection

In practice, there exists no universal method to select the shrinkage parameter. Therefore, I discuss some popular techniques to select the shrinkage parameter λ.

Proposing Lasso, Tibshirani (1996) suggested the use of k-fold cross-validation to select the shrinkage parameter, using the prediction accu-racy as performance measure. In cross-validation, the observations are (randomly) divided into k subsets, or folds, and a set of candidate val-ues for λ is preselected. Then, for each candidate value for λ, each fold is iteratively left out and the coefficients are estimated on the remain-ing k−1 folds. The prediction error is then measured in the left-out fold.

Repeating this procedure for every fold yields k values for the pre-diction error, for each candidate value for λ. Averaging these values yields the average prediction error at the candidate value for λ.

Repeating this procedure for every candidate value for λ, allows for the value for λ with the lowest prediction error to be selected and used for the dataset. This technique is able to find a “good” value for λ for the specific dataset for predictive modelling.

(32)

for λ that minimizes the prediction error. Alternatively, Roberts and Nowak (2014) propose repeating the cross-validation a large number of times to yield a vector of optimal values for λ and use the 95th percentile highest value of this set. The cross-validation and one-standard-error rule are implemented in the R-package ‘glmnet’.

To address the choice of λ in variable selection, Wang et al. (2009) propose the use of a modified BIC criterion to select the optimal value for λ in cross-validation, to replace the prediction error as criterion. The criterion is defined as follows:

BICS = log(ˆσS2) + |S| ×

log(n) n × Cn,

where |S| denotes the number of non-zero parameters in the candidate model, Cn> 0 is a chosen parameter and

ˆ σ2_S = inf βS (y − XSβS)0(y − XSβS) n ,

where XS is the input matrix with only the columns that correspond

to the non-zero coefficient estimates. However, this procedure is not feasible for p > n, as ˆσ2

S = 0 for p > n. Chand (2012) confirms the

good variable selection performance of this modified BIC in a Monte Carlo study and suggests the use of Cn =

√

n/p. This criterion is im-plemented in the R-package ‘msgps’.

(33)

pro-pose the PASS criterion (Prediction and Stability Selection). This cri-terion is implemented in the R-package ‘pass’.

(34)

7 Computation Algorithms

In this section I will first give some background about computational al-gorithms for Lasso that were developed over time. First, it is explained how the LARS algorithm works, after which the glmnet algorithm is discussed.

Since the introduction of Lasso in 1996, many computation algorithms have been proposed. Initially, computation of Lasso estimates was done with standard convex optimization algorithms, which are very compu-tationally demanding. Until Efron et al. (2004) proposed the LARS algorithm exploiting the piecewise linearity of Lasso with respect to the shrinkage parameter. This algorithm not only gave an efficient way to compute Lasso, it allowed an efficient computation of the entire Lasso path (i.e. all values of the shrinkage parameter). The availability of this efficient algorithm strongly contributed to the appreciation of Lasso, especially since it made Lasso feasible to use on large datasets. Additionally, it gave an easily interpretable insight into the effect of the shrinkage parameter on the selection of variables (Tibshirani, 2011).

(35)

7.1 LARS

The LARS (Least Angle Regression) algorithm is an algorithm designed to compute the Least Angle Regression. However, Efron et al. (2004) show that a small adjustment to the algorithm allows the computation of the Lasso path. To discuss the algorithm I will start by discussing Least Angle Regression and the closely related Forward Stagewise Re-gression. Then, I introduce the LARS algorithm and explain the mod-ification of the algorithm to solve Lasso.

7.1.1 Forward Stagewise Regression

Least Angle Regression is closely related to Forward Stagewise Regres-sion, that iteratively builds up the coefficients by updating the coeffi-cient that has the largest correlation with the current residual in small steps. To elaborate, I will first introduce some notation, following Efron et al. (2004). Let ˆβ be some candidate estimate for the coefficients β in a linear model. Then, the vector containing the predictions at ˆβ is denoted by

ˆ

µ = ˆµ( ˆβ) = X ˆβ.

Next, define the vector of current correlations for some value of ˆµ

ˆ

c = ˆc( ˆµ) = X0(y − ˆµ),

with elements ˆcj. Vector ˆc is proportional to the correlations between

(36)

residual vector:

ˆ

µ_i+1= ˆµ_i+ ρ · sign(ˆcˆ_j)xj, ˆj = arg max j

|ˆcj|,

where xj is the j’th column of X, i is the current step of the Stagewise

Regression, i + 1 is the next step, and ρ is a small constant value. This procedure is continued until all variables have zero correlation with the current residuals. At this point, the coefficient estimates are equivalent to OLS if p < n.

7.1.2 Least Angle Regression

Least Angle Regression is based on the same principle of updating the most correlated variable, but rather than taking small steps, it uses leaps. Specifically, it starts like Stagewise Regression, with all coeffi-cients equal to zero, and finds the variable most correlated with the response vector y. It then starts moving the coefficients in the direc-tion of this variable, until it reaches the point at which another variable has the same correlation with the current residual. From this point, it moves the coefficient in the direction precisely between the two most correlated variables, until they have the same correlation as a third. The procedure is continued until either all p variables have entered the model or n − 1 steps have been made.

(37)

I will now describe the algorithm in more detail. It starts with

ˆ µ₀ = 0.

At each step, it then computes the correlations with the current residual vector

ˆ

c = X0(y − ˆµA),

with elements ˆcj and the set A contains the variables which have the

highest absolute current correlation with the current residual vector:

A = {i : |ˆci| = max j |ˆcj|}.

Note that this will be multiple variables due to the nature of the LARS algorithm. The algorithm then updates ˆµ_A, which is based on the current set of variables A, to

ˆ

µ_A+ = ˆµ_A+ ˆγuA, (17)

where uA is the equiangular (“same angle”) vector, ˆγ is the distance to

move along this vector until a new variable would enter the set A, and A+ is the updated set after the new variable has entered. The vector uA and scalar ˆγ will be defined more formally below. This process is

continued until the number of variables in the model (the number of elements in A) equals min(p, n − 1).

I will now show how to compute ˆγ and uA, following Efron et al. (2004),

(38)

ma-trix

XA = (· · · sjxj · · · )j∈A,

where sj is the sign of the current correlation between variable j and

the current residual:

sj = sign[ˆcj]. Next, let GA = X0AXA, AA = (ι0AG −1 A ιA)− 1 2,

where ιA is a vector with ones of a length equal to the number of

elements in A. Then, the equiangular vector uA is given by

uA= XAwA,

where

wA = AAG−1A ιA.

Now, define

a = X0uA,

with elements aj. Then, ˆγ is given by

(39)

where min+ _{denotes that the minimum is only taken over the positive}

components for each j, and Ac_{is the set containing all variables not in}

A.

(a) LAR (b) Lasso

Figure 2: Example of the LAR and Lasso paths over the sum of the absolute value of the coefficients at each point, for simulated data with 20 parameters and 10 ob-servations. The numbers on the right are indices for the variables.

7.1.3 Modification for Lasso

(40)

equal to zero, it is removed from the current set of variables A (though it may re-enter the set later). This can also be seen in Figure 2, where, for example, the coefficient of variable 1 crosses the x-axis in Least Angle Regression (2a), but remains zero after it touches the x-axis in Lasso (2b).

The implementation of the modification works as follows. Define ˆd with elements

ˆ

dj = sjwAj,

where wAj is the j’th element of wA. Since

ˆ

µ(γ) = X ˆβ(γ),

it can be shown that the elements ˆβ are simultaneously updated to (17), to

˜

βj(γ) = ˆβj + γ ˆdj, j ∈ A.

Extrapolating, the value for γ at which ˜βj(γ) would become zero for

each j is given by γj = − ˆ βj ˆ dj .

Hence, the value for γ at which the first coefficient ˜βj reaches zero in

the current step is

˜

γ = min

γj>0

(41)

Since the angle will change at the start of the next step, one should check if the first time a coefficient hits zero in the current step happens before the step is completed: ˜γ < ˆγ. Let ˜j denote the index of the vari-able that hits zero. The modification of the LAR algorithm to obtain the Lasso path is then given by Efron et al. (2004) as follows:

Lasso Modification. If ˜γ < ˆγ, stop the ongoing LARS step at γ = ˜γ and remove ˜j from the calculation of the next equiangular direction. That is,

ˆ

µA+ = µA+ ˜γuA,

and A+ = A\{˜j}.

An implementation of the LARS algorithm and its modification is pub-licly available in the R-package ‘lars’, by T. Hastie and B. Efron.

7.2 glmnet

(42)

7.2.1 Coordinate Descent

Coordinate descent is an optimization algorithm in which the minimum of a multivariate function f (θ) may be attained by minimizing over one variable at a time. For example, one could cyclically minimize f (θ) over each of the elements of θ = (θ1, . . . , θp) holding the other elements

constant, and updating each element after it has been minimized over. In particular, after some step i in cycle k the update of one of the parameters is given by

θ_ik+1 = arg min

z

f (θk+1₁ , . . . , θk+1_i−1, z, θ_i+1k , . . . , θk_p).

The procedure is stopped when no new minimum is found after a full cycle.

Coordinate descent is not effective in finding a global minimum for all functions. However, under some mild conditions, Tseng (2001) proves that coordinate descent finds a global minimum for a function

f (θ) = g(θ) +

p

X

i=1

hi(θi), (18)

where g(θ) is convex and differentiable, and hi(θi) is convex for i =

1, . . . , p. Returning to Lasso, ˆ β_L(λ) = arg min β ( (y − Xβ)0(y − Xβ) + λ p X i=1 |βi| ) ,

(43)

continuous, and hi(βi) = λ|βi|, which is not continuous but convex.

7.2.2 glmnet Algorithm

The glmnet algorithm (Friedman et al., 2010) uses coordinate descent to solve the Elastic Net. I will present the glmnet algorithm only for Lasso.

Based on the work of Friedman et al. (2007) and van der Kooij (2007), Friedman et al. (2010) propose the use of coordinate descent for com-puting Lasso with the update

˜ β_jk+1 = S   1 n n X i=1 xijyi−   X m:| ˜βm|>0 xjxmβ˜mk  + ˜β_jk, λ  , where S(z, γ) = sign(z)(|z| − γ)+=          z − γ if z > 0 and γ < |z| z + γ if z < 0 and γ < |z| 0 if γ ≥ |z| .

After completing an initial coordinate descent cycle through all pa-rameters for a given value of λ, the algorithm only cycles through the parameters with non-zero coefficients in the next iterations, until the values have converged. A cycle through all parameters is then made to confirm that the set of non-zero coefficients does not change. If it does change, the process is repeated.

(44)

value λmax for which the entire vector ˜β = 0. It then finds the Lasso

solutions for a decreasing sequence of λ, using the coefficient values from the previous solution as starting values for the next solution.

Using these warm starts significantly increases the speed at which the entire Lasso path may be computed. Friedman et al. (2010) even claim to have found cases in which it was faster to compute the entire path down to a small value for λ than to directly compute the solution for that value of λ.

(45)

8 Extensions

In this section I shortly discuss some of the most important extensions of Lasso.

8.1 Elastic Net

The Elastic Net (Zou and Hastie, 2005) is a combination of Lasso and Ridge regression. Rather than using the either the `1-penalty or the

`2-penalty, the Elastic Net uses a linear combination of both, with an

additional tuning parameter α. It is defined as

ˆ

β_EN(λ, α) = arg min

β

(y − Xβ)0

(y − Xβ) + λ αιkβk1+ (1 − α)kβk22 .

Changing the value for α makes the properties of the solution lean more towards Ridge regression or Lasso. Specifically, the Elastic Net is equivalent to Lasso for α = 1 and equivalent to Ridge regression for α = 0.

An intuition about the effect of the Elastic Net penalty may be ob-tained from Figure 3. In this figure, the shape of the constraint is plot-ted for the Lasso, Ridge and Elastic Net (α = 0.5) penalties. Where the Ridge penalty is a perfect circle and the Lasso penalty is a diamond, the Elastic Net penalty is a combination of both: its sides are rounded, but it retains the corners on the axes.

(46)

Figure 3: Two-parameter case comparison of the Lasso (-·-·-·-), Ridge (– – –) and Elastic Net with α = 0.5 (—) penalties (from Zou and Hastie (2005)).

8.2 Adaptive Lasso

The Adaptive Lasso Zou (2006) is an extension of Lasso, which ad-dresses the problem of variable selection consistency as specified in (11) in Section 4.2, by allowing for different weights on the penalties for each variable. The Adaptive Lasso is defined as

ˆ

β_AL= arg min

β

(47)

where w is a weight vector and |β| = (|β1|, . . . , |βp|)0. To select a

weight vector, the Zou proposes the following: Select some γ > 0 (se-lection can be done through cross-validation) and define the weight vector w = 1/| ˆ_b β|γ_{, where ˆ}_{β is some root-n-consistent estimator for β,}

such as OLS.

In addition to consistency in variable selection (see equation (11) in Section 4.2), Zou (2006) shows that the coefficient estimates for the se-lected variables are asymptotically Normally distributed for the Adap-tive Lasso.

8.3 Grouped Lasso

The Grouped Lasso (Yuan and Lin, 2007) is an extension of Lasso that allows for the selection of groups of variables. The regular Lasso is only capable of performing variable selection on individual variables, but in some situations it is preferable to be able to add or remove sets of variables together. In the definition of the Grouped Lasso I follow the notation of Hastie et al. (2009).

Suppose p variables are divided in L groups, with p` variables in group

`. Let X` be an n × p` input matrix containing the variables in group

`, with coefficient β_`. The Grouped Lasso is then denoted by ˆ β_GL = arg min β ( (y − L X `=1 X`β`) 0 (y − L X `=1 X`β`) + λ L X `=1 √ p`kβ`k2 ) ,

where k·k2 denotes the l2-norm. In this definition, the

√

p`term corrects

for the different sizes of groups. Furthermore, the use of the l2-norm

(48)

8.4 Graphical Lasso

The Graphical Lasso (Meinshausen and B¨uhlmann, 2006; Friedman et al., 2008) is somewhat different than the other extensions, as it does not directly involve the estimation of variable coefficients in a linear model. The purpose of the Graphical Lasso is to estimate the elements of a sparse covariance matrix for multivariate normally distributed ran-dom variables, including the Gaussian linear model.

Consider the Normally distributed p-dimensional random variable

X = (x1, . . . , xp) ∼ N (µ, Σ),

where xi, i = 1, . . . , p are the columns of X, and N (µ, Σ) denotes

the normal distribution with mean µ and covariance matrix Σ. Note that this corresponds to a Gaussian linear model by, for example, tak-ing x1 as response variable and matrix (x2, . . . , xp) as input matrix.

Furthermore, define the precision matrix Θ = Σ−1 and let S = 1

n − 1(X − ¯X)(X − ¯X)

be the sample covariance matrix, where p-vector ¯X = ιnx¯0 and ¯x is

a p-vector containing the column means. The Graphical Lasso is then defined as

ˆ

Θ_GL = arg max

Θ:Θ≥0

{log det Θ − tr(SΘ) − λkΘk1} ,

where λ is the shrinkage parameter.

(49)

Dempster (1972), who argues that having a substantial amount of zero elements in the estimation of a sparse covariance matrix may substan-tially reduce the noise.

8.5 Bayesian Lasso

Introducing the Lasso, Tibshirani (1996) noted that a natural extension of the Lasso to Bayesian statistics exists. Consider the standard linear model as in (1) with ε ∼ N (0, σ2_I

n). Then, Lasso is equivalent to a

Bayesian posterior mode estimate if the elements of β have independent LaPlace priors π(β) = p Y i=1 λ 2e −λ|βi|_,

and σ2 has some independent prior π(σ2). The posterior distribution conditional on y is then proportional to

π(β, σ2|y) ∝ π(σ2)π(β)f (y|β, σ2) ∝ π(σ2_)(σ2₎−(n−1)/2_× exp ( − 1 2σ2(y − Xβ) 0 (y − Xβ) − λ p X i=1 |βi| ) , (19)

where f (y|β, σ2_{) is the conditional likelihood of the normal}

distribu-tion. Comparing (19) to (6), it can be seen that, maximizing this posterior distribution over β, which yields the posterior mode, is in-deed equivalent to Lasso for constant σ2.

Park and Casella (2008) propose the Bayesian Lasso, expanding on the work by Tibshirani (1996), by using a π(σ2_{) =} 1

(50)

prior for σ2 _{and conditional prior on β equal to} π(β|σ2) = p Y i=1 λ 2√σ2 exp −λ|βi| √ σ2 .

Additionally, the Bayesian framework allows for a hyperprior to be placed on λ. Park and Casella suggest placing a Gamma prior on λ2

π(λ2) = δ

r

Γ(r)λ

2_e−δλ2

, , λ2, δ, r > 0,

where the parameters δ and r have to be selected, for example through maximum likelihood. The use of a hyperprior allows for a natural way to select the shrinkage parameter.

(51)

(52)

9 Application

In this section, I will demonstrate Lasso by investigating the relation-ship between United States Treasury securities and stocks. Specifically, the 5-, 10- and 30-year treasury note prices3 and the stock prices of the companies in the Standard & Poor’s 500 are used. All data has been gathered from Yahoo!-Finance using the R-package ‘quantmod’ with daily observations of the adjusted closing price between the 3rd of Jan-uary, 2007 and the 2nd of July, 2015.

To demonstrate the functionality of Lasso, I will first use a predic-tion accuracy approach, comparing the predicpredic-tion error of Lasso with OLS. In the second part, I use a variable selection approach, looking primarily at the number of selected variables. Although the true coeffi-cient vector is unlikely to be sparse as is assumed in most results about variable selection, the achieved parsimony is still beneficial in terms of interpretability. Finally, I compare the share of each industry and average market capitalizations of the companies selected by the models with the entire set of companies.

For the models, I ignore the time scale of the data and use a cross-sectional model with the daily percentage change in the treasury note price as response and the daily percentage change of the stock prices as input. From the dataset, the 38 stocks for which more than 4 data points are missing were removed. The 5 remaining observations with missing data were also deleted. This leaves 2135 daily observations of 464 stocks and each of the treasury note prices.

3_{Since the prices are not directly available, I convert the yields into prices using}

(53)

9.1 Prediction Accuracy Approach

For the prediction approach, I use 10-fold cross-validation with the pre-diction accuracy as selection criterion for λ as proposed by Tibshirani (1996), implemented in the package ‘glmnet’. To do this, I use ran-domly selected folds and repeat the procedure 50 times to average the cross-validation error curve. An example of such a curve is displayed in Figure 5. The left vertical dotted line corresponds to the value for λ yielding the lowest prediction error.

Maturity #Variables %OLS Error log(λ) 5 Years 25 0.717 -12.38 10 Years 63 0.747 -13.17 30 Years 223 0.722 -14.39

Table 1: Results using regular cross-validation to se-lect λ. The table shows the number of sese-lected variables, the proportion of the cross-validation prediction error in comparison to OLS and the log of the shrinkage param-eter, for each level of maturity.

The results from the prediction accuracy analysis are reported in Table 1. From this table, it can be observed that Lasso is able to achieve a substantially lower prediction error compared to OLS, despite using a much more parsimonious model.

(54)

more difficult to capture with a small number of variables, as the daily fluctuations of individual stock prices hold little information about the long term economic state. This problem may be alleviated by choosing a larger set of variables.

(55)

9.2 Variable Selection Approach

In this section, I select the shrinkage parameter such that variable se-lection is emphasized and a parsimonious model is obtained.

Unfortunately, the R-package ‘pass’, which implements PASS (Sun et al., 2013), is currently not completely functional. Additionally, the R-package ‘msgps’, implementing the adjusted BIC criterion (Wang et al., 2009) does not return the cross-validation prediction errors. Hence, to select the shrinkage parameter for the variable selection ap-proach I use the one-standard-error rule (Hastie et al., 2009) in 10-fold cross-validation, as described in Section 6, which is implemented in the package ‘glmnet’.

I again use randomly selected folds and repeat the procedure 50 times to average the cross-validation error curve. In Figure 5, the right ver-tical dotted line corresponds to the value for λ that selects the most parsimonious model, but is at most one standard error away from the lowest value.

Maturity #Variables %OLS Error log(λ) 5 Years 5 0.754 -10.98 10 Years 18 0.788 -11.87 30 Years 81 0.750 -13.46

Table 2: Results using the one-standard-error rule to select λ. The table shows the number of selected vari-ables, the proportion of the cross-validation prediction error in comparison to OLS and the log of the shrinkage parameter, for each level of maturity.

(56)

Comparing these with the results from results in Table 1, the number of selected variables is significantly lower. In exchange, the prediction errors are slightly higher. Comparing with OLS, it is remarkable how Lasso retains a substantial reduction in the prediction accuracy using a very small number of variables. In the case of the 5 Years maturity, only approximately 1% of the total set of available variables are used, but the prediction error is reduced by 25% in comparison to the prediction error of OLS.

9.3 Analysis of Selected Companies

In this subsection, I investigate some properties of the companies for which the daily stock price changes were selected by the Lasso models. To do this, I compare the share of the industries and market capital-izations of the companies in the selected models with the entire set of companies.

Maturity C. Discr. C. St. Energy Fin. Health Industr. IT Mat. Tel. Util. 5 Yr. 0.00 0.00 0.14 0.09 0.05 0.27 0.32 0.05 0.00 0.09 Prediction 10 Yr. 0.09 0.07 0.07 0.20 0.06 0.15 0.20 0.07 0.00 0.07 30 Yr. 0.20 0.06 0.04 0.18 0.09 0.14 0.15 0.08 0.01 0.05 5 Yr. 0.00 0.00 0.00 0.20 0.00 0.80 0.00 0.00 0.00 0.00 Var. Sel. 10 Yr. 0.00 0.00 0.19 0.12 0.06 0.31 0.31 0.00 0.00 0.00 30 Yr. 0.13 0.08 0.06 0.21 0.07 0.14 0.15 0.08 0.00 0.07 All 0.17 0.08 0.09 0.17 0.10 0.14 0.13 0.06 0.01 0.06

Table 3: Division of the companies over the industries for each model and the complete set. From left to right: Consumer Discretionary, Consumer Staples, Energy, Fi-nancials, Health Care, Industrials, Information Technol-ogy, Materials, Telecommunication Services and Utilities.

(57)

each model. The final row shows the division of the companies in the full dataset. From this table, it can be observed that the Consumer Dis-cretionaries and Consumer Staples sectors seem to be underrepresented in comparison to the full dataset. On average, only 7% of the compa-nies in the selected models are in the Consumer Discretionaries sector, compared to 17% in the complete dataset. For the Consumer Staples sector, the average is 3.5% compared to 8% in the complete dataset. On the other hand, the Industrials are strongly overrepresented, with an average of 30% of the companies in the selected models but only 14% in the complete dataset. This suggests that stock price changes of companies in the Industrial sector contain relatively much information about the daily fluctuations of the treasury securities.

Maturity Avg. Market Cap. 5 Yr. 24.04 Prediction 10 Yr. 35.47 30 Yr. 31.53 5 Yr. 37.39 Var. Sel. 10 Yr. 24.67 30 Yr. 36.58

All 37.76

Table 4: Average market capitalization in billions of U.S. Dollars of the companies selected by each model and the full set of companies.

(58)

9.4 Conclusion

(59)

References

Akaike, H., 1974. A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6):716–723.

Bickel, P. J., Ritov, Y., and Tsybakov, A. B., 2009. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732.

Breiman, L., 1995. Better subset regression using the nonnegative gar-rote. Technometrics, 37(4):373–384.

Burnham, K. P. and Anderson, D. R., 2004. Multimodel inference understanding aic and bic in model selection. Sociological methods & research, 33(2):261–304.

Chand, S. On tuning parameter selection of lasso-type methods-a monte carlo study. In Applied Sciences and Technology (IBCAST), 2012 9th International Bhurban Conference on, pages 120–129. IEEE, 2012.

Cohen, J. et al., 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement, 20(1):37–46.

Dalalyan, A. S., Hebiri, M., and Lederer, J., 2014. On the prediction performance of the lasso. arXiv preprint arXiv:1402.1700.

Dempster, A. P., 1972. Covariance selection. Biometrics, pages 157– 175.

Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al., 2004. Least angle regression. The Annals of statistics, 32(2):407–499.

(60)

Frank, L. E. and Friedman, J. H., 1993. A statistical view of some chemometrics regression tools. Technometrics, 35(2):109–135.

Friedman, J., Hastie, T., H¨ofling, H., Tibshirani, R., et al., 2007. Path-wise coordinate optimization. The Annals of Applied Statistics, 1(2): 302–332.

Friedman, J., Hastie, T., and Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): 432–441.

Friedman, J., Hastie, T., and Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.

Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning, 2009.

Hebiri, M. and Lederer, J., 2013. How correlations influence lasso pre-diction. Information Theory, IEEE Transactions on, 59(3):1846– 1854.

Hocking, R. R., 1976. A biometrics invited paper. the analysis and selection of variables in linear regression. Biometrics, pages 1–49.

Hoerl, A. E. and Kennard, R. W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67.

Hofmann, M., Gatu, C., and Kontoghiorghes, E. J., 2007. Efficient algorithms for computing the best subset regression models for large-scale problems. Computational Statistics & Data Analysis, 52(1): 16–29.

(61)

Koltchinskii, V., Lounici, K., Tsybakov, A. B., et al., 2011. Nuclear-norm penalization and optimal rates for noisy low-rank matrix com-pletion. The Annals of Statistics, 39(5):2302–2329.

Leng, C., Lin, Y., and Wahba, G., 2006. A note on the lasso and related procedures in model selection. Statistica Sinica, 16(4):1273.

Mallows, C. L., 1973. Some comments on cp. Technometrics, 15(4):

661–675.

Marquardt, D. W. and Snee, R. D., 1975. Ridge regression in practice. The American Statistician, 29(1):3–20.

Meinshausen, N. and B¨uhlmann, P., 2006. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, pages 1436–1462.

Meinshausen, N. and B¨uhlmann, P., 2010. Stability selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(4):417–473.

Park, T. and Casella, G., 2008. The bayesian lasso. Journal of the American Statistical Association, 103(482):681–686.

Roberts, S. and Nowak, G., 2014. Stabilizing the lasso against cross-validation variability. Computational Statistics & Data Analysis, 70: 198–211.

Ryan, T. P., 1997. Modern regression methods. J. Wiley, New York, Chichester.

(62)

Sun, W., Wang, J., and Fang, Y., 2013. Consistent selection of tuning parameters via variable selection stability. The Journal of Machine Learning Research, 14(1):3419–3440.

Tibshirani, R., 1996. Regression shrinkage and seleciton via the lasso. Journal of Royal Statistical Society, 58:267–288.

Tibshirani, R., 2011. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–282.

Tseng, P., 2001. Convergence of a block coordinate descent method for nondifferentiable minimization. Journal of optimization theory and applications, 109(3):475–494.

van der Geer, S. and B¨uhlmann, P., 2011. Statistics for High-Dimensional Data: Methods, Theory and Applications. Spring Series in Statistics, Springer.

van der Kooij, A. J., 2007. Prediction accuracy and stability of re-gression with optimal scaling transformations. Child & Family Stud-ies and Data Theory (AGP-D), Department of Education and Child Studies, Faculty of Social and Behavioural Sciences, Leiden Univer-sity.

Wang, H., Li, B., and Leng, C., 2009. Shrinkage tuning parameter selection with a diverging number of parameters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 71(3): 671–683.

(63)

Zhao, P. and Yu, B., 2006. On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563.

Zou, H., 2006. The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.

Background and Extensions of Lasso Applied to Bond and Stock Returns

Background and Extensions of

Lasso Applied to Bond and Stock

Returns

Background and Extensions of

Lasso Applied to Bond and Stock

Returns

Nick W. Koning

Abstract

Contents

1

Introduction

2

Historical Approach

2.1

Two problems with OLS

2.2

Subset Selection

2.3

Ridge Regression

3

Lasso

3.1

Definition

3.2

Comparison with Ridge regression

3.3

Comparison with subset selection

4

Variable Selection

4.1

Asymptotic Distribution

4.2

Variable selection consistency

5

Prediction Error

5.1

Slow Rate Bound

5.2

Fast Rate Bound

5.3

Practical note on bounds

6

Practical Shrinkage Parameter Selection

7

Computation Algorithms

7.1

LARS

7.2

glmnet

8

Extensions

8.1

Elastic Net

8.2

Adaptive Lasso

8.3

Grouped Lasso

8.4

Graphical Lasso

8.5

Bayesian Lasso

9

Application

9.1

Prediction Accuracy Approach

9.2

Variable Selection Approach

9.3

Analysis of Selected Companies

9.4

Conclusion

References