• No results found

Three essays in econometric theory

N/A
N/A
Protected

Academic year: 2021

Share "Three essays in econometric theory"

Copied!
122
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Tilburg University

Three essays in econometric theory Gan, Zhuojiong

Publication date:

2015

Document Version

Publisher's PDF, also known as Version of record

Link to publication in Tilburg University Research Portal

Citation for published version (APA):

Gan, Z. (2015). Three essays in econometric theory. CentER, Center for Economic Research.

General rights

Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain

• You may freely distribute the URL identifying the publication in the public portal

Take down policy

(2)

PROEFSCHRIFT

ter verkrijging van de graad van doctor aan Tilburg University op gezag van de rector magnificus, prof.dr. E.H.L. Aarts, in het openbaar te verdedigen ten overstaan van een door het college voor promoties aangewezen commissie in de aula van de Universiteit op dinsdag 1 september 2015 om 14.15 uur door

(3)

OVERIGE COMMISSIELEDEN: prof.dr. B.J.M. Werker prof.dr. D.J.C. van Dijk dr. R. van den Akker dr. P. Cizek

(4)

This dissertation contains my work as a PhD student at Department of Econometrics and Operations Research in Tilburg University. It would not have been possible for me to finish it without the support from many people, to whom I would like to express my gratitude.

(5)

and for their help with my job market. I am also grateful to the rest of my committee: Maurice Bun, Pavel Cizek and Dick van Dijk. This dissertation benefited a lot from their suggestions and insights.

I am thankful to many colleagues and friends in our department. I have always enjoyed the discussions with them; they never failed to inspire me with their innovative thoughts: Jaap Abbring, Alaa Abi Morshed, Juanjuan Cai, Yi He, Jan Kab´atek, Tobias Klein, Kam-lesh Kumar, Jinghua Lei, Bertrand Melenberg, Renata Raboviˇc, Mario Rothfelder, Serhan Sadikoglu, Martin Salm, Takamasa Suzuki, Yifan Yu, Bo Zhou, and many others.

Throughout my six years of stay in Tilburg, I came to know many people, who have made my life enjoyable and full of surprises: Yufeng Huang, Kan Ji, Xue Jia, Shuai Kou, Hong Li, Zhengyu Li, Bowen Luo, Geng Niu, Yuanhao Sun, Jan Tilly, Ruixin Wang, Wen-dun Wang, Yun Wang, and Yachang Zeng. I am also grateful to my friends in Tilsac, especially to Leo Buysse and Annelies van den Bogaard; your compassion and encourage-ment added so much to the fun and enjoyencourage-ment in climbing.

Finally I would like to express my gratitude to my parents and my wife for their understanding and support. I have been away from my hometown for almost nine years, and my parents have always been supportive for my decision to pursue my academic career. I can feel their love and caring whenever I am in need. I would also like to thank my wife, Keyan, who is always by my side, and with whom I can share my happiness and sorrows. You light my way.

(6)

1 Introduction 1

2 Factor-augmented Prediction with Idiosyncratic Factors 3

2.1 Introduction . . . 3

2.2 Model and Preliminaries . . . 8

2.3 Assumptions . . . 11

2.4 Estimation and Prediction . . . 14

2.4.1 Factor Estimation . . . 14

2.4.2 Prediction . . . 15

2.5 Asymptotic Theory . . . 18

2.5.1 Known k0 Case . . . 18

2.5.2 Consistent Estimation of k0 and I0 . . . 21

2.6 Simulations . . . 22

2.7 Empirical Application . . . 40

2.8 Conclusions . . . 43

2.A Proof of theorems and lemmas . . . 45

2.A.1 Proof of Proposition 1 . . . 45

(7)

2.A.4 Proof of Corollary 5 . . . 55

2.A.5 Proof of Theorem 6 . . . 56

2.B Miscellaneous . . . 58

2.B.1 Example of models that satisfies Assumption 3 . . . 58

2.B.2 Derivation of equation (2.3) . . . 60

2.B.3 {εt+heit} ∞ t=1 is a martingale difference for i /∈ I0 . . . 60

2.B.4 Specification for the variables in Table 2.13 . . . 61

3 Testing for Central Symmetry 63 3.1 Introduction . . . 63

3.2 Main Result . . . 64

3.3 Simulation Study . . . 68

3.4 Proofs . . . 71

4 Break Point Estimation in Fixed Effects Panel Data 75 4.1 Introduction . . . 75

4.2 Model . . . 78

4.3 Break Point Estimator . . . 80

4.4 PLS Estimators . . . 82

4.5 Slope Estimators and Their Asymptotic Properties . . . 84

4.5.1 Estimators . . . 84

4.5.2 Asymptotic distributions of proposed estimators . . . 86

4.5.3 Comparison of coefficient estimators . . . 88

4.6 Simulation Study . . . 89

(8)

4.A.1 Proof of Lemma 1 . . . 95

4.A.2 Proof of Theorem 1 . . . 96

4.A.3 Proof of Theorem 2 . . . 97

4.A.4 Proof of Theorem 3 . . . 99

4.A.5 Proof of Theorem 4 . . . 102

4.A.6 Proof of Theorem 5 . . . 103

(9)
(10)

Introduction

This dissertation consists of three essays in econometric theory. In the first essay we explore a predictive model based on factor analysis. In the second essay we study tests for central symmetry of bivariate distributions. In the third essay we are interested in the estimation of structural breaks and slopes in panel data with individual-specific fixed effects.

(11)

the performance of our method for predicting inflation, and compare our prediction with that of a few standard methods.

(12)

Factor-augmented Prediction with

Idiosyncratic Factors

2.1

Introduction

In recent years, economists are faced with the challenge of predicting in the presence of a large number of variables. One key econometric advance is the use of factor models, which have proven instrumental for accurate prediction of economic variables in many areas of interest.

Factor analysis is nowadays the predominant framework in financial and macroeco-nomics forecasting with many predictors. In finance, factors are at the very foundation of the arbitrage pricing theory - see e.g. Ross (1976) and Chamberlain and Rothschild (1983). They have also been used to analyze the risk-return relationship in Ludvigson and Ng (2007), and the bond-risk premia in Ludvigson and Ng (2009), to mention just a few financial applications.

(13)

production, Giannone et al. (2004) for tracking monetary policy in real time, Cristadoro et al. (2005) for predicting euro inflation, Bernanke et al. (2005) for using a factor-augmented VAR to identify monetary policy transmission mechanisms, Altissimo et al. (2010) for pre-dicting economic growth, and Banbura and Modugno (2014) for nowcasting the european GDP.

The use of factor models is not limited to finance and macroeconomics. They are also used in microeconomics for identification, estimation and prediction purposes. For example, Lewbel (1991) uses factors to analyze consumer behavior at the theoretical and empirical level. Carneiro et al. (2003) use factors to identify and estimate counterfactual distributions and to measure the effect of uncertainty on schooling choices. Cunha et al. (2010) generalize the approach in Carneiro et al. (2003) to nonlinear dynamic factor models, with the purpose of estimating the production technology of cognitive and non-cognitive skill formation.

In this paper, we focus on linear factor models. With a few exceptions, most of these models are approximate factor models, a terminology popularized by Chamberlain and Rothschild (1983). Approximate factor models are models in which the factors can be estimated via principal component methods, because the errors in the model are only weakly correlated. Further, we only consider stationary factors in this paper; the treatment of integrated factor processes can be found in, e.g., Banerjee et al. (2014), where factor-augmented error correction models are developed.

(14)

true factors, up to a rotation, so that their estimation error is asymptotically negligible when used for time-series prediction. They also derived asymptotic confidence intervals for the conditional mean of their prediction. Forni et al. (2005) propose generalized factor models to improve the efficiency of factor estimators.

Most papers concerned with factor-augmented prediction - mentioned above - proceed in two steps: first, they extract the common factors from a large set of predictors with methods such as principal components, and second, they use these factors for prediction in the second step. Such methods are in general not efficient for at least two reasons.

The first reason is that the common factors extracted using principal components may not have good predictive power for the series of interest. This is pointed out in Bai and Ng (2008, 2009). This problem is usually tackled by finding the common variation in X that is more relevant for predicting y. For example, Bai and Ng (2008) solve this problem by selecting the variables from which the common factors are extracted via hard and soft threshold rules. Heij et al. (2008) consider combining the factor estimation and prediction into a single step. Kelly and Pruitt (2015) propose using a three band-pass filter to extract, among the common factors, only the factors relevant for prediction. Cheng and Hansen (2015) propose combining forecasts from different models that vary in the number of factors and lags of the variable of interest.

The second reason for predictive inefficiency is that the current methods disregard the fact that some predictors may exhibit idiosyncratic variation, on top of the common factor variation, that is relevant for prediction. For example, suppose we want to predict the monthly unemployment in some state in US. The usual approach would be to extract in the first step common factors from a large set of state-level and national-level variables, and use these common factors as predictors in the second step. This approach will not always be efficient, because there may be (a small number of) predictors such as employee union densities or previous unemployment claims - see Hagedorn et al. (2013) - that have relevant idiosyncratic variation, on top of their co-movement with other predictors. This idiosyncratic variation may refer to variation in state-specific laws or policy that does not reflect in common factors, but is still relevant for unemployment prediction.

(15)

second source has not received much attention. Methods that can incorporate relevant idiosyncratic information are limited. One possibility, as is considered by Groen and Kapetanios (2009), is to use partial least squares (PLS), where the factors are extracted in relation to the series of interest. Using this method, it is possible, although not shown theoretically in Groen and Kapetanios (2009), that the extracted factors can incorporate information from both the common and the idiosyncratic variation.

In this paper we try to solve the second problem. To our knowledge, this is the first paper that explicitly models the idiosyncratic variation as a source of predictive information (besides PLS, which has the potential of incorporating idiosyncratic variation but has not been explicitly shown to do so). We consider a data generating process where the idiosyncratic variation is related to the series of interest, such that prediction efficiency1

can be improved.

First, in the classical statistical terminology of factor analysis, a large set of variables (predictors) can be decomposed into two parts: common factors and idiosyncratic factors - also see Forni et al. (2005). We will use this terminology throughout the paper. The idiosyncratic factors are generally mean-zero variables, uncorrelated with the common fac-tors, and so, if they are ignored, they will in general not bias the prediction asymptotically, but reduce its precision. Our simulations show that the efficiency loss from ignoring the idiosyncratic factors may be substantial.

The challenge is both how to use the idiosyncratic factors to improve prediction, and also how to identify which variables, if any, possess relevant idiosyncratic factors.

If the set of relevant idiosyncratic factors was known, our method proposes estimating them, and using them as additional regressors, jointly with the common factors, in the prediction equation. Asymptotically, we do not introduce bias, because we are adding mean-zero idiosyncratic factors that are uncorrelated with the common factors. However, we reduce the prediction error by taking into account the correlation of the idiosyncratic factors with the variable of interest. Our correction is chosen to minimize the prediction mean-squared error, and so in this sense, it is by construction more precise than the usual factor-based prediction.

1Throughout this paper, we define the term efficient prediction to mean the prediction that minimizes

(16)

In reality, researchers don’t necessarily know which variables, if any, are more relevant for prediction than others; particularly with large datasets, they are difficult to pin-point. In this paper, we propose a new selection method to identify both the number k0 and the set I0 of the idiosyncratic factors. We assume that we don’t know the fixed number k0;

in practice, this means that the number of relevant idiosyncratic factors is small relative to the sample size. This is a reasonable assumption; if there are many relevant idiosyn-cratic factors, then they tend to co-move, in which case they can no longer be considered idiosyncratic factors, but rather common factors.

First, for each k, we identify the set of idiosyncratic factors that minimizes the mean-squared error of our forecast, and call it ˆIk. Next, we select the number k via a non-standard

penalized mean-squared error criterion, evaluated at ˆIk.2 We show that our method selects

both the true k and the true Ik with probability tending to one, as N, T → ∞, as long

as N5/T4 → 0. This requirement does not seem to be too restrictive in macroeconomic applications, where the number of predictors and time periods is usually comparable3.

Our method also identifies the case k = 0, which is the case where there are no relevant idiosyncratic factors. Moreover, it is computationally efficient because it does not search over all subsets of the predictors; rather, it involves sorting the predictors according to their explanatory power. Thus our method, while similar to forward stepwise selection methods, is even more computationally efficient.4 Our theoretical results show that our method identifies the true set of relevant idiosyncratic factors with probability tending to one as N, T tend to infinity.

Our simulation results show that when there are relevant idiosyncratic factors, our method not only detects their true number and their true set accurately, but outpeforms Bai and Ng (2006) by delivering lower out-of-sample prediction mean-squared error for the variable of interest.

The rest of the paper is organized as follows. In Section 2 we specify our model, give examples, and explain our choice of prediction correction. In Section 3 we present our assumptions. Section 4 explains our estimation and prediction methods heuristically, and 2Our criterion resembles a Mallows criterion but also has important differences. See Section 5.2 and 6. 3Our simulations show that our method also works for cases where N is bigger than T .

4Forward stepwise selection methods are known to be computationally superior to all subset methods

(17)

presents an efficient computational algorithm to estimate our model. In Section 5, we derive asymptotic results for our method. We study the finite-sample properties of our method via simulations in Section 6. In Section 7 we present the empirical application. Section 8 concludes. All the proofs are relegated to the Mathematical Appendix.

2.2

Model and Preliminaries

In specifying the model used for prediction, we closely follow Bai and Ng (2002) and Bai and Ng (2006). Denote by xit the available scalar predictors, where i = 1, . . . , N and

t = 1, . . . , T . These predictors are assumed to have an approximate factor structure with a fixed number of factors r, that is,

xit = ft0λi + eit, (2.1)

where ft and λi are r × 1 vectors of factors and factor loadings respectively, independent

with the errors eit. The factors and their loadings are unobserved, and can be estimated

from the data via least-squares methods. The errors eit are the unobserved idiosyncratic

factors; they are regressor specific variation that is not correlated with the common factors. For simplicity, we recast the factor model in matrix form:

xt (N ×1) = Λ (N ×r) ft (r×1) + et (N ×1) ,

where xt= (x1t, . . . , xN t)0, et= (e1t, . . . , eN t)0, and Λ = (λ1, . . . , λN)0, or:

X (T ×N ) = F (T ×r) Λ0 (r×N ) + e (T ×N ) , with xt= (x1t, . . . , xN t)0 and X = (x1, . . . , xT)0.

Assume that we observe a scalar dependent variable yt for t = 1, . . . , T , and are

(18)

model in Bai and Ng (2006), which is the direct approach to factor-augmented prediction:

yt+h = ft0γ + εt+h. (2.2)

For simplicity we do not include additional exogenous regressors in (2.2).5 Equation (2.2)

is widely used as the prediction equation in this literature. It is well-suited for many economic applications, where the series of interest depend on the past overall economic conditions, represented by the common factors ft.

There are at least two alternative ways to model the prediction equation. The first alternative is mentioned in Stock and Watson (2011), where yt+1 depend on ft: yt+1 =

ft0γ+εt+1, and ftfollows a VAR process. The h-step ahead forecast of yT +his constructed as

ˆ

yT +h = ˜fT +h−10 γ where ˜fT +h−1is some forecast of fT +h−1 at time T using the VAR process.

It is pointed out in Stock and Watson (2011) that both equation (2.2) and this alternative are theoretically viable, and the empirical papers that compare the two models also lead to mixed results. The other alternative to model the prediction equation is via the linear model where yt+hdepends on only a limited number of xit’s. In this case, it is desirable that

only a few “important” regressors enter into the prediction equation, rather than using the common variation that is summarized from all variables. Thus alternative methods that searches over all xit’s for a few useful predictors can be more effective. Examples of such

methods include Least Absolute Shrinkage and Selection Operator (LASSO) and stepwise forward selection. However, we stick to equation (2.2) as the prediction equation in this paper because of its relevance for many economic applications.

The novelty of our paper is that we allow for correlation between εt+h and a finite set of

the eit’s, for i ∈ Ik0 ≡ I0, the true set of relevant predictors, of fixed cardinality k0. This

means that there is correlation between the variable of interest at time t + h and shocks or idiosyncratic factors at time t.

Example 1. Suppose that we want to predict the unemployment rate yt+h. This can

be done via extracting common factors ft from a large set of macroeconomic variables xit,

including worker unions, unemployment benefits, and unemployment claims - see Hagedorn 5When there are additional regressors, the prediction is called factor-augmented. The additional

(19)

et al. (2013). However, the latter three variables may exhibit political regime variation that is not well represented common factors. The variation in these eit’s is thus idiosyncratic

and correlated with the unemployment rate yt+h.

Example 2. Suppose one is interested in predicting individual wages yj (here, j are

individuals rather than time, and prediction is done in-sample). There are a lot of variables xij including past education, family background, family education, IQ, and cognitive and

non-cognitive personality traits that may help predict the labor market outcome. However, extracting the common factors fj and using these common factors for prediction is not

sufficient, as argued in Piatek and Pinger (2010). Personality traits are usually measured with error, and this error seems to be related to the locus of control, or how much a person feels is in control of their own life at a certain moment. This locus of control may be reflected in a few idiosyncratic factors eij (thus, for some i’s), and they may determine

negotiation skills and therefore the wage as well. Piatek and Pinger (2010) propose to use pre-market locus of control to predict wage. Our method would choose the personality trait variables that yield relevant idiosyncratic factors, and correct for this in the wage prediction equations.

We propose estimating the models in (2.1) and (2.2) first, as usual, in two steps. In accordance with the literature, we use T observations, from t = 1, . . . , T , to estimate ft,

and T − h observations, from h + 1 to T , to estimate γ. The resulting estimates for ft and

γ are consistent under certain assumptions.6 Then, we predict all y

t+h based on both ft

and the relevant idiosyncratic factors eit’s, which we estimate from the data. We show in

Section 2.4 that our proposed correction is chosen in such a way to directly minimize the mean-squared error of our prediction yt in the sample h + 1, . . . , T . Thus, we minimize the

average h-step ahead squared prediction error for all the observations in the sample. Two alternative approaches come to mind. First, if variable i is relevant for prediction, it will be tempting to use xit directly the prediction equation, rather than correcting for

correlation of eit with εt+h. However, this is not desirable because the xit’s are highly

correlated, and selecting the relevant ones would result in a much more computationally

6Finite-sample performance of the principal components estimator can be improved if we use the

(20)

complex method than the one we propose in Section 2.4. Second, one may think of jointly estimating (2.1) and (2.2). Methods such as principal covariate regression, proposed in Heij et al. (2007), explore this possibility by weighing the residuals from estimating (2.1) and (2.2) in the prediction criterion. This procedure balances the explanatory power of the factors between x and y. The weight used in this method is chosen by the user. However, in general, it is not clear that such an approach minimizes the prediction error in yt+h,

because the objective function that is used in estimating a large multivariate model is not targeted at minimizing the prediction error of the single equation (2.2).

By contrast, our method directly minimizes prediction error, and in that sense is supe-rior to both alternatives described above.

2.3

Assumptions

We present some notation before we proceed.

Notation: Throughout this paper k · k (without subscript) will be used to denote norms: for vectors it is used to denote the Euclidean norm, and for matrices to denote the ma-trix norm induced by the Euclidean norm. That is, a real mama-trix A has norm kAk = √

largest eigenvalue of A0A. We use the notation k · k

p (with subscript p) for the Lp norm

of a random variable/vector. Let Ik denote an arbitrary index set with k elements, which

is a subset of {1, 2, . . . , N }. Throughout this paper M is used to denote a generic constant and can be different according to the context.

We make the following assumptions:

Assumption 1. The factor process satisfies supt∈NEkftk4 ≤ M for some M > 0, and 1 T PT t=1ftf 0 t p

→ ΣF > 0 as T → ∞ for some positive definite matrix ΣF.

Assumption 2. The factor loading λi is either deterministic such that supi∈Nkλik ≤ M

for some M > 0, or it is stochastic such that supi∈NEkλik4 ≤ M . In either case, N1Λ0Λ p

→ ΣΛ > 0, ΣΛ being an r × r non-random matrix, as N → ∞.

Assumption 3.

(21)

(ii) There exist E[eitejs] = σij,ts and |σij,ts| ≤ ¯σij for all (t, s), and |σij,ts| ≤ τts for all

(i, j) such that N1 PN

i,j=1σ¯ij ≤ M , 1 T PT t,s=1τts ≤ M , and 1 N T P i,j,t,s|σij,ts| ≤ M .

(iii) For every (t, s), E|N−1/2PN

i=1(eiseit− E[eiseit])|

4 ≤ M .

(iv) For each t, N−1/2PN

i=1λieit d → N (0, Γt), where Γt = lim N →∞N −1 N X i=1 N X j=1 E[λiλ0jeitejt].

Assumption 4. The variables {λi}, {ft} and {eit} are three mutually independent groups.

Dependence within each group is allowed.

Assumption 5. E[εt+h|yt, ft, yt−1, ft−1. . .] = 0 for any h > 0. ft is independent of eis for

all i, s, and εt is independent of eis for i /∈ I0 or t 6= s + h. Moreover √T −h1 PT −ht=1 ftεt+h d → N (0, Σf f,ε), where Σf f,ε = plimT −h1 PT −h t=1 (ε 2 t+hftft0) > 0; √1T PT t=1εt+h d → N (0, σ2 ε) where σ2 ε = plim 1 T PT t=1ε 2

t+h. We further assume that E|εt+h|δ < M for some δ ≥ 2.

Assumption 6. T −h1 PT −h

t=1 εt+heit p

→ wi for all i. Furthermore, wi 6= 0 for i ∈ I0, and

wi = 0 for i /∈ I0, where I0 is a finite index set with k0 elements.

Assumption 7. (i) {eit}∞t=1 is independent over i. (ii) E| 1 √ T

PT

t=1(e2it− σ2i)|2 < M , with

infi≥1σi2 > C for some constant C > 0.

Assumption 1-4 are identical to Assumption A-D in Bai and Ng (2006). Assumption 1 and 2 are standard in factor models. Assumption 3 characterizes the time-series and cross-sectional dependence allowed in the idiosyncratic factors. Under first four assumptions, the model is said to have an approximate factor structure, and, on average, the estimated factors have a small deviation from the true factors, up to a rotation (see Lemma A.1 in Bai and Ng (2006), or Lemma 2.A.1(i) in this paper). Assumption 4 can be weakened (see Assumption D in Bai and Ng (2002)). Assumption 5 corresponds to Assumption E in Bai and Ng (2006), but we do not assume full independence between εt and eis, to allow

for relevant idiosyncratic factors. Furthermore, we added to Assumption E in Bai and Ng (2006) the requirement that E|εt+h|δ < M for some δ ≥ 2. As we will see in the next

(22)

Assumption 6 is the key assumption of this paper. In this assumption, we generalized the model to allow for dependence between εt+h and some of the idiosyncratic factors eit’s.

Prediction precision can thus be improved by incorporating these idiosyncratic factors as explanatory variables in the prediction equation. Assumption 6 is natural from both a theoretical and an empirical point of view. Theoretically, if we consider the vector (x0t, yt+h)0 to have an approximate factor structure, then εt+h and eit can be correlated.

Empirically, the correlation between some eit’s and εt+hcan be interpreted as co-movements

between relevant predictors and yt+h. These co-movements are only for a small number of

variables; they are not pervasive in the set of predictors. If a large number of variables were all relevant for prediction, they would likely be themselves highly correlated, which would result in common factors rather than idiosyncratic factors. Thus, in our setting, assuming that the number of relevant idiosyncratic factors k0is fixed is the most reasonable assumption.

Assumption 7(i) can be viewed as an identification condition for k0 and I0. Under this

assumption, at least asymptotically, an idiosyncratic factor can have predictive power if and only if i ∈ I0, that is, eit is correlated with t+h. Under Assumption 7(i), together

with Assumption 3(ii), only dependence over time is allowed for idiosyncratic factors, thus our model is more restrictive than the approximate factor model in Bai and Ng (2002). However, with this assumption, we can derive an efficient algorithm, while the violation of this assumption is not very costly for our method, as we show in the simulation section. We delay further discussion of this assumption to the end of Section 2.4, because the prediction procedure needs to be first addressed before understanding the role of this assumption.

(23)

2.4

Estimation and Prediction

The prediction problem is composed of three steps. First, the factors and the factor loadings are estimated using principal component method. Second, the model in equation (2.2) is estimated to obtain a consistent estimator for γ. Third, we use a penalized model selection approach to find the optimal predictor set.

2.4.1

Factor Estimation

For simplicity, in this paper, number of factors r as given. However, r can be con-sistently estimated when N, T is large via direct application of the Bai and Ng (2002) information criterion.

As proposed in Bai and Ng (2002) and Bai and Ng (2006) for approximate factor models, we estimate the factors in (2.1) by minimizing the objective function:

min Λ,F 1 N T N X i=1 T X t=1 (xit− ft0λi)2,

subject to the normalization F0F/T = Ir. The estimated factors, denoted ˆF = ( ˆf1, . . . , ˆfT)0,

consist of r eigenvectors associated with the r largest eigenvalues of the matrix XX0/(N T ). This is the least-squares principal component method, and it is the most widely used method to estimate approximate factor models; also see Stock and Watson (2002a). An-other way to estimate the factors, which is different from the method considered in this paper, is to use maximum likelihood. The resulting factor estimator is more efficient, see Bai and Li (2012).7 Let ˆΛ = (ˆλ

1, . . . , ˆλN)0 be the estimated factor loadings. Then, by

construction, ˆΛ = X0F /T .ˆ

7The assumptions imposed in Bai and Li (2012) are close to those in this paper. In their paper, e itis

assumed to be independent both cross-sectionally and over time. Thus the assumptions on the idiosyncratic factors are stronger than those in this paper. One difference is that in their paper ft is assumed to be

deterministic, but it is stated that the result continues to hold if ft is random and independent of all

other variables. Thus, under our assumptions, the maximum likelihood estimator still has the properties stated in Bai and Li (2012). However Theorem 5.1 in Bai and Li (2012) might suggest a different rate of convergence of ˆft, which may affect the final form of our penalty term in Theorem 6 if their estimator

(24)

2.4.2

Prediction

In Bai and Ng (2006), and most empirical applications (see for example Stock and Watson (2002b) and Ludvigson and Ng (2009)), yT +h is predicted by ˆft0ˆγ. When there

is correlation between eit and t+h, such a prediction is not efficient, because it uses only

information that is common to all regressors, and not idiosyncratic information that is also relevant for prediction. To find a more efficient prediction method in the sense of minimizing the mean-squared prediction error (MSE), we augment the prediction formula to take into account the potential correlation.

For example, if a variable i has a relevant idiosyncratic factor, and E(eitεt+h) = wi 6= 0,

then we can predict yt+h by regressing it on both ˆftand ˆeit = xit− ˆft0γ. Our correction has

the flavor of a generalized least-squares correction, which was used by Goldberger (1962) for best linear unbiased prediction in a generalized linear model framework.8

The challenge is that neither the size k0 nor the composition of true set of idiosyncratic

factors, I0, is known, and yet the number of potential predictors (N ) is large. If we search

over all possible subsets of potential predictors, the problem becomes quickly computa-tionally intractable, since the number of subsets is of order 2N. To solve this problem, we

provide an alternative selection method that is much faster, and borrows elements from forward selection methods for regressors in a linear model.

We predict yT +h with the following steps:

(1) Regress yt+h on ˆft to get ˆγ, as in Bai and Ng (2006);

(2) Estimate wi and σi2 by ˆwi = T −h1 PT −h t=1 εˆt+heˆit and ˆσ2i = 1 T PT t=1eˆ2it respectively, where ˆ

εt+h = yt+h− ˆft0γ and ˆˆ eit = xit− ˆft0λˆi are residuals from previous steps;

(3) For a given k, we sort ˆwi2σˆ−2i ’s over i, and let ˆIk be the collection of indices that

corresponds to the k-largest ˆw2 iσˆ

−2

i ’s. It can be seen that ˆIk is also the minimizer of

8Alternatively, once could proceed without this correction, and use the usual ˆy

t+h but with adjusted

(25)

the following objective function: min Ik MT(Ik) = min Ik 1 T − h T −h X t=1 ˆ ε2t+h−X i∈Ik ˆ w2iσˆi−2 ! .

A detailed discussion of our choice of objective function above (the prediction mean squared error9) can be found below.

(4) A penalty term on k is introduced to avoid over-fitting:

min k h MT( ˆIk) + kDN Tg(N, T ) i ,

where DN T is an Op(1) scaling quantity that does not depend on k.10 We call the

minimizer of the penalized criterion above ˆk. The corresponding estimated index set is denoted ˆI := ˆIkˆ.

(5) Finally we predict yT +h by the predictor set { ˆft}S{ˆeit, i ∈ ˆI} using least squares.

Our method is similar to, but simpler than forward selection models in linear regression: we first pick for each k the set of regressors ˆeitthat are most correlated with the variable of

interest, record this set for each k, and next, pick among k the recorded set that minimizes the penalized prediction MSE11. Also, like in forward selection, we order the ˆe

it’s on their

explanatory power for yt+h, from high to low, and for each k, ˆIk corresponds to those ˆeit’s

with the first k-largest values of ˆwi2/ˆσ2i.

Note that because of Assumption 7(i), eit are independent over i, and ˆwi/ˆσi2 is in

fact the least-squares coefficient from a regression of yt+h− ˆft0γ on ˆˆ eit. This justifies first

estimating the model in Step (1) and (2) without a correction, then applying the correction for idiosyncratic factors ˆeit.

We now discuss of the criterion function in Step (3). Note that ˆw2 iσˆ

−2

i is the change in

9This is the approximate prediction error because we do not take into account some cross-products

that are vanishing in the limit. Our approximation greatly simplifies computation.

10We discuss its choice in the simulation section; it resembles the scaling quantity in the Mallows

criterion.

11Our method is different from LASSO and least-angle regression (LARS): with our method there is no

(26)

the prediction MSE, or sum of squared residuals (SSR) of the prediction divided by T − h, when ˆeit is used as a predictor in addition to ˆft. If we use SSR(Ik) to denote the SSR

when the predictor set is { ˆft}S{ˆeit, i ∈ Ik}, and SSR(∅) when the predictor set is { ˆft},

then it follows from Appendix 2.B.2 that:

1 T −h(SSR(∅) − SSR({i})) = 1 T −h PT −h t=1 εˆ 2 t+h− T −h1 PT −h t=1 (ˆεt+h− ˆwiσˆ −2 i ˆeit)2 = ˆw2iσˆ −2 i . (2.3) Moreover for i 6= j, 1 T −h(SSR({j}) − SSR({i, j})) = T −h1 PT −h t=1 (ˆεt+h− ˆwjˆσ −2 j eˆjt)2− T −h1 PT −ht=1 (ˆεt+h− ˆwiσˆi−2eˆit− ˆwjσˆ−2j eˆjt)2 = ˆw2 iσˆ −2 i − 2 ˆwiwˆjσˆi−2σˆ −2 j 1 T −h PT −h t=1 eˆiteˆjt. The term 1 T −h PT −h

t=1 eˆiteˆjt is negligible as T → ∞. This means each additional regressor

ˆ

eit, asymptotically orthogonal to the previous one, accounts for variation approximately

equal to ˆw2 iσˆ

−2

i , irrespective of other regressors. Thus, to approximate 1

T −hSSR(Ik) we

only need to subtract those variations from 1

T −hSSR(∅) = 1 T −h PT −h t=1 εˆ 2 t+h: 1 T − hSSR(Ik) ≈ 1 T − h T −h X t=1 ˆ ε2t+h−X i∈Ik ˆ w2iσˆi−2 := MT(Ik).

The algorithm is easy to implement and requires little computation. It only involves estimating wi’s and σi2’s by their consistent sample equivalent.

The intuitive explanation above also demonstrates that, for a given k, we do not need to search over all possible subsets Ik, but rather order the quantities ˆw2iσˆ

−2

i , and pick the

k largest. Thus, the minimization problem in Step (3) does not add much computational complexity than the usual factor-augmented prediction, where the idiosyncratic factors are ignored.

Once we have each set ˆIk, for Step (4), we penalize the sum of squared residuals to

avoid over-fitting. The form of the penalty g(N, T ) and the scaling quantity DN T in Step

(27)

necessary, because the prediction MSE will always decrease with additional idiosyncratic factors used as regressors.

We now discuss the role of Assumption 7. In our method, we essentially use idiosyn-cratic factors eit’s as additional predictors. With Assumption 7, these predictors are

or-thogonal, which means they account for the same variation in yt+h, irrespective of other

predictors. This leads to the criterion function in Step (3) and an algorithm that is fast and easy to implement. Meanwhile, the violation of Assumption 7(i) is not costly. The procedure in Step (3) means that the explanatory power of an idiosyncratic factor is com-puted irrespective of its correlation with other eit’s. We thus claim without a proof that

asymptotically all idiosyncratic factors that have direct explanatory power (wi 6= 0) are

going to be selected, whether or not Assumption 7(i) is satisfied or not. The correlation between idiosyncratic factors, if not zero, is going to be accounted for only in Step (5), when we regress yt+h on all common factors and selected predictors. We demonstrate this

point in our simulations, when the correlation between idiosyncratic factors is non-zero because the number of common factors is underestimated (Table 2.11).

2.5

Asymptotic Theory

In this section we show that when the penalty term g(N, T ) satisfies certain conditions, the number (k0) and the set (I0) of predictive idiosyncratic factors can be correctly esti-mated with probability tending to 1 as N, T → ∞, with the additional assumption that N5/T4 → 0. We first study the asymptotic behavior of the model selection problem for a

known k0.

2.5.1

Known k

0

Case

We first establish the consistency and derive the convergence rate of ˆγ to its true value γ, up to a rotation H defined below. Following Bai and Ng (2002), CN T denotes

min{√N ,√T }.

(28)

of XX0/(T N ), and H = ˜V−1( ˆF0F/T )(Λ0Λ/N ). Then, under Assumption 1-5, ˆγ − H0−1γ = Op(T−1/2) + Op(CN T−2) as N, T → ∞.

Proof. See Appendix 2.A.1.

Bai and Ng (2006) establish the same rate of convergence, and further show asymptotic normality of√T (ˆγ−H0−1γ), but their result is derived under the assumption of√T /N → 0. For our purposes, √T /N → 0 is not needed, and all we require is that both N, T → ∞. Therefore, in the appendix, we re-derived the result stated above, under a similar set of assumptions as in Bai and Ng (2006).

Next, we show that both the infeasible estimator of wi, obtained from the true errors,

and its feasible equivalent ˆwi, obtained from the estimated errors, converge to their true

values uniformly. Moreover, we show that the variance estimator of eit, ˆσ2i, converges to

its true value uniformly. The estimators ˆwi, ˆσ2i, together with their rates of convergence,

imply that we can use the idiosyncratic factors as if they were known.

Lemma 2. Let ˜wi = T1 PTt=1εt+heit be the infeasible estimator of wi, and recall that its

feasible estimator is ˆwi = T1

PT

t=1εˆt+heˆit. Under Assumption 1-5, the following holds:

max1≤i≤N| ˆwi− ˜wi| = Op(CN T−1) Op(N1/8) + Op(N1/2T−1/2) ,

max1≤i≤N,i /∈I0| ˆwi| = Op(CN T−1) Op(N1/8) + Op(N1/2T−1/2) + Op(N1/ min{8,δ}T−1/2).

Both terms are op(1) when N, T → ∞ and N/T → 0.

Lemma 3. max1≤i≤N|ˆσ2i − σi2| = op(1) if N, T → ∞ and N5/T4 → 0.

For the proof see Appendix 2.A.2. The two lemmas give the order of the maximal deviation of ˆwi and ˆσi2 from the corresponding true values.

Intuitively, a larger N means that the maxima over the random variables ˆwi and ˆσi2

tend to increase with N . At the same time, the term max1≤i≤N,i /∈I0| ˆwi| characterizes the

“spurious relationship” between ˆεt+h and ˆeit. When the cross-sectional dimension is large,

some ˆeit’s will eventually seem to explain ˆεt+h, even if they are unrelated. On the other

hand, a larger T counters this effect by pulling each ˆwi closer to zero: for each i, ˆwi p

(29)

T → ∞. The relative rate between N and T is needed to ensure that uniform-convergence over N still holds. The requirement is stronger for max1≤i≤N|ˆσi2 − σi2|, where we need

N5/T4 → 0, and this is a result of keeping our assumptions as comparable to Bai and Ng (2006) as possible.

It is also worth noting that the order of max1≤i≤N,i /∈I0| ˆwi| depends on the moments of

εt+h. The reason is as follows: the moment condition tells us how heavy is the tail of εt+h.

The higher the δ, the less heavy the tails are. As a result, spurious detecting of relevant idiosyncratic factors is less likely, because large deviations of ˆwi are also less likely. This

moment condition also influences the size of our penalty - see Theorem 6 and the discussion following it.

For a better understanding on the model selection method we propose, fix k for now. In the following theorem, we show that when N5/T4 → 0, so N is not too big compared to

T , the criterion function MT(·) converges uniformly to the asymptotic criterion function,

which we denote by M (·).

Theorem 4. Under Assumption 1-7, for each fixed k, maxIk|MT(Ik) − M (Ik)| p → 0 if N5/T4 → 0, for M (Ik) = σε2− P i∈Ikw 2 iσ −2 i .

Proof. See Appendix 2.A.3.

In this theorem the term maxIk|MT(Ik) − M (Ik)| depends on N implicitly through

the maximum taken over Ik. The maximum is taken over all k-subsets of {1, 2, . . . , N },

which entails   N k 

 possibilities. Thus the N -T relationship serves the same purpose as

in Lemma 2 and 3, to ensure uniform convergence of the criterion function over N . It should be obvious that M (Ik0) is minimized at the true set I0: wii−2 is non-zero

only for i ∈ I0. Thus, to find the minimum of M (I

k0), we just need to include all k0

predictors in the set {ˆeit, i ∈ I0}.

(30)

all the random variables involved in our model. This, in conjunction with tighter maximal inequalities, may lead to better rates of convergence than in Bai and Ng (2006) and also in our paper. However, writing such assumptions might yield another relationship between the convergence rates of N, T , because tighter maximal inequalities for double index arrays, needed here, will most often require such a relationship.

An immediate consequence of the above theorem is that when k0 is known, the model

selection is consistent in the sense that the index set can be chosen correctly with proba-bility tending to 1. This can be shown using the the same line of arguments as in Theorem 5.7 in van der Vaart (2000); see Appendix 2.A.4.

Corollary 5. For known k0, P (I0 = arg min

Ik0MT(Ik0)) → 1.

2.5.2

Consistent Estimation of k

0

and I

0

When k0 is unknown, we penalize on k as in Step (4) to restrict the complexity of the model. The model can be selected consistently when g(N, T ) satisfies the conditions in the following theorem:

Theorem 6. Under Assumption 1-7, when N5/T4 → 0, if g(N, T ) satisfies: (i) g(N, T ) →

0; (ii) g(N, T )/ max{N−3/4, N1/ min{4,δ/2}T−1} → ∞, then lim

N,T →∞P (ˆk = k0) = 1.

More-over the subset selection is also consistent: limN,T →∞P ( ˆI = I0) = 1.

Proof. See Appendix 2.A.5.

This theorem says that we select all the relevant idiosyncratic factors for prediction, and thus the true model, with probability tending to one as N, T → ∞, and N5/T4 → 0. Thus, we obtain the all subsets solution, without the need for an all subsets algorithm.

In this theorem, the first requirement (g(N, T ) → 0) is standard in model selection problems. The second requirement is not conventional. Essentially it imposes two require-ments: g(N, T )  N−3/4 and g(N, T )  N1/ min{4,δ/2}T−1. The first requirement comes from the order of max1≤i≤N,i /∈I0| ˆwi− ˜wi| and the second from that of max1≤i≤N,i /∈I0| ˜wi|.

(31)

maximum over subsets, which grows with N even if k0 is fixed. If we do not penalize for N

in a high-dimensional problem, and simply use classical information criteria to penalize for T, k only, we will end up selecting more variables than necessary. This is also pointed out in Broman and Speed (2002) and Casella et al. (2009). Second, we have a factor structure in the model, which brings about non-conventional rates arising from factor estimation error.

The penalty term also depends on δ. So the moment condition in Assumption 5 plays a role in this result. A bigger δ loosens the requirement of g(N, T ), and allows for smaller penalty functions. As explained previously, when δ is big, a spurious detection of relevant idiosyncratic factors is less likely, and thus a small penalty term is sufficient to consistently select the model. Note that in such a case, a large penalty term still satisfies requirement (i) and (ii), and is asymptotically equivalent in terms of consistency of model selection.

2.6

Simulations

In this section we use 50000 replications for all tables. We generate the model in (2.1) using the following two setups.

Setup 1. For i = 1, . . . , N and t = 1, . . . , T , we generate ft ∼ N (0, I2) independent over

t = 1, . . . , T ; λi ∼ N (0, I2) independent over i = 1, . . . , N ; eit ∼ N (0, 1) independent over

i, t.

Setup 2. We use the same data generating process as in the simulation in Bai and Ng (2006) and generate serial correlated factors and cross-sectionally correlated idiosyncratic errors: fjt = ρjfj,t−1 + (1 − ρ2j)1/2ujt, j = 1, 2, ρ = 0.8j; λi ∼ N (0, I2) independent over

i = 1, . . . , N ; et= ¯Ω(0.5)1/2vt, vt∼ N (0, IN), with ¯Ω(0.5) being the Toeplitz matrix whose

jth main diagonal is 0.5j if j ≤ 10 and is zero otherwise. Note that in the second setup

we are considering a case where Assumption 7 is violated (eit is correlated with ejt for

|i − j| ≤ 10).

(32)

t = 1, . . . , T . We use the following specification for εt+1: εt+1= p a/k0 k0 X i=1 eit+ √ 1 − a ˜εt+1, (2.4)

where ˜εt+1 ∼ N (0, 1) is independent over t for t + 1 = 1, . . . , T . Under this specification,

a is the percentage variation of εt+1 that can be explained by eit’s. In other words, wi =

E(eitεt+1) = pa/k0, and

Pk0

i=1wi2σ −2

i = a. We apply the algorithm described in Section

2.4.2 to obtain the relevant idiosyncratic errors, pertaining to index set ˆI, predict yT +h

using fT and {ˆeiT, i ∈ ˆI}, and calculate the out-of-sample prediction MSE one time period

ahead as (yT +1− ˆyT +1)2, averaged over all simulations. We calculate the standard error

of the simulated MSE, reported in brackets. We also report the percentage of correctly estimated k0 and I0 in our tables for Setup 1.

Our method uses both the common and the idiosyncratic factors for prediction, and thus we denote our method as CF-IF. For Setup 1, we compare our results to several other predictions. First, we compare our results to the “infeasible prediction”, denoted infeasible CF-IF, where the true set of relevant idiosyncratic factors I0 is known, and the prediction MSE is obtained as above, but for the true I0 rather than ˆI. Second, we

compare them to Bai and Ng (2006), denoted CF, where the out-of-sample prediction is based on common factors only, and is calculated as ˆyT +1 = ˆfT0ˆγ. Third, we compare our

results to the PLS method. Similar to the factor model, the PLS method constructs linear combinations of the predictor variables, called PLS components, to forecast the variable of interest. The difference is that, in PLS, the covariance between the variable of interest and the PLS components is maximized. So PLS considers the covariance between the variable of interest and the predictors, and can be a good alternative to factor models. In our simulation, we consider different numbers of PLS components, and report those that results in the lowest prediction MSE; for example, in Table 2.4, PLS(3) is reported, which means in this table the prediction MSE is smallest when the number of PLS components is set to 3. For Setup 2, comparison is only made with CF and PLS, since for the “infeasible prediction” the true I0 does not contain all the information useful for prediction, due to

the violation of Assumption 7.

(33)

kmax on k0 for computing a practical DN T, and select the penalty g(N, T ) such that it

satisfies the assumptions of Theorem 6. The upper bound kmax is only for finding DN T; it

is not necessary to restrict the optimization problem to k ≤ kmax, neither from a theoretical

nor from a computational point of view. We set kmax = 10 and k0 = 0, 1, 2 or 3. (When

k0 = 0, equation (2.4) is not well defined; in this case we simply let ε

t+h ∼ N (0, 1) be

independent of all eit’s.) Since we are penalizing T−1SSR(Ik) and not its logarithm, we

choose a scaling quantity for the penalty that resembles the Mallows criterion penalty, so that the scaling quantity is of comparable size to the objective function. We find that DN T = T−1SSR( ˆIkmax) works well in finite samples.

We choose the penalty function that seems to perform best in finite samples, g(N, T ) = N0.5/T0.9. Note that if N5/T4 → 0, then N0.5/T0.9 → 0, so g(N, T ) → 0. Also,

be-cause εt+h ∼ N (0, 1), all moments of εt+h exist, and so δ = ∞ in Assumption 5. Thus,

g(N, T )/ max{N−3/4, N1/ min{4,δ/2}T−1} = N5/4/T0.9.

In Table 2.1 we report our simulation results for ˆγ. Since estimation of γ is done is the first stage, the values of a and k0 do not affect the finite sample properties of γ, given that

all error terms follow a normal distribution with unit variance. Thus we only report ˆγ’s that correspond to our first simulation exercise (Table 2.2), with a = 0.1 and k0 = 1. With

γ = (1 1)0, we find that almost all reported values are close to the true value 1, although there seems to be a downward bias in all cases. The downward bias is smaller when T or N is bigger. This is due to the more precise estimation of the common factors when N and T are large.

Tables 2.2-2.4 show the simulated MSE of prediction under Setup 1 for our method, the infeasible method and the method in Bai and Ng (2006), for a = 0.1, 0.2, 0.3 and k0 = 1. We find that across all simulations with k0 = 1, our MSE is lower than that in Bai and Ng

(34)

that Bai and Ng (2006) does better than PLS for a = 0.1 or 0.2 (Table 2.2 and 2.3) but worse for a = 0.3 (Table 2.4), although the difference is small in all three cases. This may imply that PLS tends to incorporate the idiosyncratic factor when it becomes more important as a becomes bigger.

Figures 2.1-2.2 show how the precision of ˆk changes with respect to T or N . In Figure 2.1, N = 100 and T = 100, 150 or 200. When the sample size is relatively small (T = 100), our information criterion favours a smaller model. It can also be seen that the frequency of ˆk = k0 goes up quickly as T increases. On the other hand, Figure 2.2 shows that when

T is fixed and number of potential predictors becomes larger, there is at most a moderate decrease in the frequency of ˆk = k0. This shows, as in the results of Tables 2.2-2.4, that as long as T is large enough, the performance of our method is not much affected by a large N .

In Tables 2.5-2.8 we report the simulated MSE with k0 = 2, 3. Keeping a fixed, we find that the estimation of k0 and I0 is less precise when k0 is bigger. This is because each

relevant idiosyncratic factors now accounts for a smaller amount of relevant information in our simulation design, and thus becomes harder to detect. However, our method has a smaller MSE in all cases with T = 500 compared to Bai and Ng (2006)’s method and PLS. Table 2.9 shows the simulated MSE under the first setup when k0 = 0. In this case the

true index set is I0 = ∅, and in theory no improvement in MSE should occur by using

idiosyncratic factors as predictors. We find that the percentage of ˆk = 0 is more than 79% across all simulations in Table 2.9, meaning that when no improvement can be made, our model selection method picks the same model as that in Bai and Ng (2006) in most cases. Our MSE is also close to that in Bai and Ng (2006). This means that using idiosyncratic factors when none are relevant does not harm forecast accuracy.

Table 2.10 shows the simulated MSE under the second setup for a = 0.1 and k0 = 1.

We find that although an important assumption in this paper is violated, our method still performs well. Compared to Bai and Ng (2006), our method leads to better prediction in most cases. The improvement is bigger when T is larger.

(35)

underestimated. Intuitively, when the predictive information from some common factors is not accounted for in the first stage, our method will be able to summarize part of this information using idiosyncratic factors, and thus reduce the loss from specifying a smaller number of common factors. We study this feature from the results in Table 2.11. The data are generated using the first setup with k0 = 0. However, in the first stage, we take

the number of common factors as one, while the true number of common factors (r) is two. Thus only the first common factor enters the second stage. All the other steps in the prediction are the same. We find that in all cases, our method leads to better prediction than Bai and Ng (2006)’s method with only one estimated common factor. In our method, on average around 8 or 9 idiosyncratic factors are used as the proxy for the second common factor. We further compare our method with the standard case where both common factors are estimated. Our procedure in the second step picks up on the information not accounted for in the first stage, but the MSE is larger than that when the true number of common factors is selected. Also note that this serves as another example where Assumption 7 is violated: when only one common factors is estimated, the residuals of the first stage are highly correlated.

We also ran the same simulations as in Table 2.2-2.4 with ˜t+j ∼ t(4), a t-distribution

(36)
(37)
(38)
(39)
(40)
(41)
(42)
(43)
(44)
(45)
(46)
(47)
(48)
(49)

2.7

Empirical Application

In this section we evaluate the performance of our method for predicting inflation. The data are taken from the website of Federal Reserve Bank of St. Louis12. The original

dataset contains 130 macroeconomic time series, mostly monthly. We choose to predict inflation, because inflation is well documented by Stock and Watson (1999) to be a chal-lenging series to forecast. Stock and Watson (2002b) and Bai and Ng (2008) further study this prediction problem by comparing classical methods with factor-based predictive meth-ods. We will study how our method performs compared to standard methods in predicting this challenging series. The variable that represents inflation is the second difference in the log of CPIAUCSL (consumer price index for all urban consumers for all items, which we abbreviate as CPI in this section).

We use the data from Jan 1960 (denoted 1960:1). We drop the series from the dataset if it is discontinued. To use the factor model, we transform all series to make them stationary. All transformations of the variables are as in Stock and Watson (2002b), Appendix B. The series that are not on their list are dropped. We obtain N = 85 variables over T = 656 time periods.

We consider and compare the following candidate models: (i) the benchmark AR(1) model (abbreviated AR(1)); (ii) the standard Bai and Ng (2002) with only common fac-tors (CF); (iii) the method in this paper, with common facfac-tors estimated in the first stage and idiosyncratic factors estimated in the second stage (CF-IF); (iv) the AR(1) model, augmented with common factors (AR(1)-CF); (v) the AR(1) model, augmented with id-iosyncratic factors estimated in the second stage (AR(1)-IF).13

For all methods, we construct an h-step-ahead forecast. Note that for a different predic-tion horizon h, the selected model and parameter values are usually different. Each forecast is based on ten years of data (i.e. T = 120). For example, to compute the 1-step-ahead forecast of CPI at 1970:1, we only use the part of the dataset from 1960:1 to 1969:12 (120 periods). The forecasted value is then compared with the observed value to calculate the prediction MSE, reported in Table 2.14. We also report heteroscedasticity and

autocor-12http://research.stlouisfed.org/fred2/categories/33488

13We also considered AR(1)-CF-IF. However, the method does not lead to better prediction than the

(50)

relation consistent (HAC) standard errors in parentheses. Since using shorter forecasting periods (T = 60 or 30) results in a larger prediction MSE, possibly because factors are not estimated precisely due to the smaller sample size, we only consider T = 120.

We estimate the model using the same penalty as in the simulation section. The only difference in terms of estimation is that, apart from using estimated number of common factors (ˆr) according to Bai and Ng (2002), we also consider a fixed number of common factors, with r = 1, . . . , 4. This is because we find that for this dataset, the estimated number of common factors is too large (ˆr between 16 and 23) to be practically useful.

In Table 2.12 we report the number of idiosyncratic factors ˆk chosen on average under the method CF-IF and AR(1)-IF. It can be observed from Table 2.12 that for a short prediction horizon, our method selects more idiosyncratic factors. For a longer horizon, there are less relevant idiosyncratic factors, and accurate prediction seem to be harder to achieve. Moreover, when the number of common factors chosen is larger, there seems to be less information in the idiosyncratic factors, and thus our method selects fewer idiosyncratic factors. In Table 2.13 we report the five most frequently selected idiosyncratic factors for r = 1, ordered using the frequency of each series appearing in the top five explanatory idiosyncratic factors. Since hardly any idiosyncratic factors are selected for h > 6, we exclude those cases from Table 2.13. We only report the case r = 1 because the top-five list hardly changes with respect to the choice of r. It can be seen that in the short run, variables representing personal consumption expenditures (DNDGRG3M086SBEA, PCEPI), together with other price indices (CUSR0000SA0L5, etc.) are more relevant for prediction.

Table 2.12: Average number of idiosyncratic factors chosen (ˆk) for predicting CPI

Method h=1 h=2 h=3 h=6 h=12 CF-IF r = 1 6.18 4.27 0.26 0.41 0.35 r = 2 6.62 3.95 0.25 0.47 0.37 r = 3 6.17 1.45 0.15 0.32 0.35 r = 4 5.12 1.47 0.20 0.34 0.39 r = ˆr 3.30 1.49 0.52 0.70 1.03 AR(1)-IF r = 0 2.66 4.69 0.06 0.35 0.18

(51)

Table 2.13: Five most explanatory idiosyncratic factors for predicting CPI, in descending order

h=1 h=2 h=3

1 DNDGRG3M086SBEA CPITRNSL USGOVT

2 PCEPI DNDGRG3M086SBEA CPIMEDSL

3 CUSR0000SA0L5 PPIFCG CPITRNSL

4 CUSR0000SAS CUSR0000SAC NAPMPI

5 CUUR0000SA0L2 PCEPI PCEPI

h: prediction horizon; r: number of common factors. The specifications for the variables in this table can be found in Appendix 2.B.4.

(52)

Table 2.14: Prediction MSE for CPI, in one millionth model h=1 h=2 h=3 h=6 h=12 Bai and r = 1 8.21 8.23 8.14 8.22 8.23 Ng(2006) (1.53) (1.54) (1.48) (1.50) (1.49) (CF) r = 2 8.36 8.27 8.20 8.28 8.29 (1.61) (1.45) (1.50) (1.50) (1.49) r = 3 8.50 7.81 8.29 8.39 8.30 (1.63) (1.38) (1.49) (1.51) (1.46) r = 4 8.46 8.13 8.48 8.81 8.44 (1.43) (1.45) (1.55) (1.62) (1.47) r = ˆr 8.69 8.98 10.47 10.85 10.85 (1.18) (1.42) (1.98) (1.71) (1.74) Our r = 1 7.52 7.49 8.20 8.28 8.43 method (1.13) (1.23) (1.48) (1.50) (1.49) (CF-IF) r = 2 8.15 8.17 8.26 8.35 8.53 (1.34) (1.47) (1.50) (1.52) (1.48) r = 3 7.55 8.04 8.65 8.47 8.55 (1.07) (1.45) (1.61) (1.50) (1.50) r = 4 7.49 8.27 8.87 9.13 8.52 (1.16) (1.47) (1.69) (1.64) (1.47) r = ˆr 8.40 9.50 11.55 11.36 11.55 (1.17) (1.53) (2.15) (1.74) (1.73) AR(1)-CF r = 1 7.66 8.68 8.18 8.26 8.24 (1.33) (1.50) (1.47) (1.51) (1.49) r = 2 7.86 8.73 8.19 8.32 8.31 (1.43) (1.41) (1.47) (1.51) (1.49) r = 3 7.92 8.14 8.21 8.41 8.32 (1.43) (1.35) (1.49) (1.52) (1.48) r = 4 8.12 8.28 8.32 8.78 8.47 (1.46) (1.39) (1.51) (1.63) (1.48) r = ˆr 8.36 8.86 10.48 10.29 11.09 (1.08) (1.27) (1.97) (1.69) (1.83) AR(1) 7.50 8.47 8.12 8.14 8.17 (1.21) (1.42) (1.45) (1.48) (1.49) AR(1)-IF 6.81 7.88 8.13 8.15 8.57 (1.23) (1.21) (1.46) (1.48) (1.56)

h: prediction horizon; r: number of common factors

2.8

Conclusions

(53)

allowing a set of variables to have idiosyncratic factors that are relevant for prediction, and propose correcting the point prediction by including the idiosyncratic factors as predictors. We also propose a new model selection method to find the set of variables that are relevant for prediction, minimizing the prediction error subject to a penalty on the number of variables used. We prove that our method finds this set with probability tending to one, as long as N, T → ∞, and N5/T4 → 0.

Also, our method is computationally efficient, in the sense that we do not need to search over all possible subsets of the data. We propose an algorithm that is similar to a forward selection of regressors in a linear model with orthogonal regressors. In practice, the computational complexity is comparable to that of the usual factor-augmented prediction. We also show that our method results in lower mean-squared prediction error for a large set of N, T pairs, compared to the factor-augmented prediction in Bai and Ng (2006).

Interesting extensions of our methods would include additional predictors (such as lags) in the prediction equation, correlated idiosyncratic factors, multivariate prediction, and common factors that are also selected based on their relevance for prediction.

(54)

Appendix

2.A

Proof of theorems and lemmas

2.A.1

Proof of Proposition 1

To prove Proposition 1 and the two theorems we need the following lemma. This lemma is due to Bai and Ng (2006) Lemma A.1(i), and its proof can be found in the working paper version of their paper. Let ˜V be the r × r diagonal matrix consisting of the r largest eigenvalues of XX0/(T N ), and H = ˜V−1( ˆF0F/T )(Λ0Λ/N ).

Lemma 2.A.1. Under Assumptions 1-5, when N, T → ∞, (i) T1 PT t=1k ˆft− Hftk 2 = O p(CN,T−2 ); (ii) T −h1 PT −h t=1 ( ˆft− Hft)f 0 t = Op(CN T−2); (iii) T −h1 PT −h t=1 ( ˆft− Hft) ˆf 0 t = Op(CN T−2); (iv) T −h1 PT −h t=1 ( ˆft− Hft)εt+h = Op(CN T−2). Proof of Proposition 1

Proof. ˆγ is the coefficient from regressing yt+h on ˆft. Rewrite the model as

yt+h= ft0γ + εt+h= ˆft0H 0−1γ + (Hf t− ˆft)0H0−1γ + εt+h, or in matrix notation as y = ˆF H0−1γ + ε + (F H0− ˆF )H0−1γ. So ˆγ −H0−1γ =FˆT0Fˆ −1 ˆ F0ε T + ˆ F0Fˆ T −1 ˆF0(F H0− ˆF )H0−1γ

T . ˆF is normalized such that ˆF

0F /T =ˆ

I. So the first term equals

(55)

by Assumption 5 and Lemma 2.A.1(iv). For the second term, by Lemma 2.A.1, T1Fˆ0(F H0− ˆ F ) = T1 PT t=1fˆt(Hft− ˆft) 0 = O p(CN T−2). So ˆγ − H 0−1γ = O p(T−1/2) + Op(CN T−2).

2.A.2

Proof of Lemma 2 and 3

We need the order in probability of the following terms before we proceed. As in Lemma 2, we let ˜wi = T1 PT t=1εt+heit and ˜σ2i = 1 T PT t=1e2it.

Lemma 2.A.2. The order of the terms are as follows:

(i) Under Assumption 1, 3 and 4, max1≤i≤NkT1 PTt=1fteitk = Op(N1/2T−1/2);

(ii) Under Assumption 3, max1≤i≤N|˜σ2i| = Op(N1/4);

(iii) Under Assumption 7, max1≤i≤N|˜σ2i − σi2| = Op(N1/2T−1/2);

(iv) Under Assumption 3-5, max1≤i≤N,i /∈I0| ˜wi| = Op(N1/ min{8,δ}T−1/2).

Proof. For (i), using Markov’s Inequality,

P (max1≤i≤NkT1 PT t=1fteitk > ) ≤ PN i=1P (k 1 T PT t=1fteitk > ) ≤ N · max1≤i≤NP (kT1 PT t=1fteitk > ) ≤ N ·max1≤i≤NEkT1 PT t=1fteitk2 2 . (2.5) Now

max1≤i≤NEkT1 PTt=1fteitk2 = T1 max1≤i≤N T1 PTt=1PTs=1E[ft0fseiteis]

= T1 max1≤i≤N T1 PTt=1PTs=1E[ft0fs]E[eiteis]

≤ M1 T max1≤i≤N 1 T PT t=1 PT s=1E[eiteis] ≤ M2/T,

where we have used Assumption 3(ii), 4, and that E[ft0fs] ≤ 12E[ft0ft+ fs0fs] is bounded

under Assumption 1. Returning to equation (2.5), P (pT /N max1≤i≤NkT1

PT

t=1fteitk >

) ≤ M2/2 for any  > 0, so max

1≤i≤NkT1

PT

(56)

The second result can be treated similarly. Use k · kp to denote the usual Lp-norm for

random variables (i.e. k · kp = (E| · |p)1/p). Using the same inequalities as in equation (2.5),

we have P (max1≤i≤N|˜σ2i| > ) ≤ N · max1≤i≤NE|T1 PTt=1e2 it|4 4 ≤ N ·max1≤i≤NkT1 PT t=1e2itk44 4 ≤ N ·max1≤i≤N( 1 T PT t=1ke2itk4) 4 4 = N ·max1≤i≤N( 1 T PT t=1(E|eit|8)1/4) 4 4 ≤ N · M/4,

where the last inequality follows from Assumption 3(i) that E|eit|8 ≤ M . Thus it follows

that max1≤i≤N|T1

PT

t=1e 2

it| = Op(N1/4).

For (iii), under Assumption 7,

P (max1≤i≤N|˜σi2− σi2| > ) ≤ N T · max1≤i≤NE|√1 T PT t=1(e2it−σi2)|2 2 ≤ N T · M 2.

So max1≤i≤N|˜σi2 − σ2i| = Op(N1/2T−1/2). For (iv), the maximum is taken over the set

[1, N ]\I0:

P (max1≤i≤N,i /∈I0| ˜wi| > ) ≤ N

max1≤i≤N,i /∈I0E|T1PTt=1εt+heit|min{8,δ}

min{8,δ} . (2.6)

From Appendix 2.B.3 we know that εt+heit forms a martingale difference over t for i /∈ I0.

Thus we can apply Inequality (1.1) from Dharmadhikari et al. (1968) and get

E|√1 T PT t=1εt+heit| min{8,δ}≤ M1 T PT t=1E|εt+heit| min{8,δ}≤ M2.

Put it back to equation (2.6), we get

P (max1≤i≤N,i /∈I0| ˜wi| > ) ≤ N Tmin{8,δ}/2 ·

M2 min{8,δ}

(57)

The next lemma shows a result which is itself of theoretical interest. It gives an upper bound on the maximum deviation of ˆλi’s from λi’s. Bai (2003) shows a corresponding

result for max1≤t≤T k ˆft− H0ftk.

Lemma 2.A.3. Under Assumption 1-5, max1≤i≤NkH0−1λi− ˆλik = Op(CN T−2) · Op(N1/4) +

Op(N1/2T−1/2).

Proof. Let ei = (ei1, . . . , eiT)0. Using the relation ˆΛ = X0F /T (see Section 2.4), the termˆ

can be expanded as max1≤i≤NkH0−1λi− ˆλik = max1≤i≤NkH0−1λi− ˆ F0(F λi+ei) T k ≤ max1≤i≤Nk ˆ F0( ˆF −F H0) T H 0−1λ ik + max1≤i≤Nk ˆ F0e i T k = max1≤i≤NkT1 PT t=1fˆt( ˆft− Hft)0 · H0−1λik + max1≤i≤NkT1 PT t=1fteitk ≤ k1 T PT

t=1fˆt( ˆft− Hft)0k · kH0−1k max1≤i≤Nkλik + max1≤i≤NkT1

PT

t=1fteitk.

By Lemma 2.A.1, T1 PT

t=1fˆt( ˆft− Hft) 0 = O

p(CN T−2). By Assumption 2, the 4-th moment of

λi exists, so max1≤i≤Nkλik = Op(N1/4)14. By Lemma 2.A.2(i), max1≤i≤N|T1

PT t=1fteit| = Op(N1/2T−1/2). As a result max 1≤i≤NkH 0−1λ i− ˆλik = Op(CN T−2) · Op(N1/4) + Op(N1/2T−1/2).

Lemma 2.A.4. Under Assumption 1-5,

(i) T −h1 PT −h

t=1 (f 0

tγ − ˆftγ)ˆ 2 = Op(CN T−2);

(ii) max1≤i≤N T −h1 PT −ht=1 (ft0λi− ˆft0λˆi)2 = Op(CN T−2)Op(N1/2) + Op(N T−1).

Proof. For (i), add and subtract ft0H0γ in the sum. We then haveˆ

1 T −h PT −h t=1 (f 0 tγ − ˆf 0 tγ)ˆ 2 = 1 T −h PT −h t=1 (f 0 tγ − f 0 tH 0γ + fˆ 0 tH 0γ − ˆˆ f0 tˆγ)2 ≤ 2 T −h PT −h t=1 (f 0 tγ − f 0 tH 0ˆγ)2+ 2 T −h PT −h t=1 (f 0 tH 0γ − ˆˆ f0 tˆγ)2. (2.7)

14This can be derived by the following maximal inequality: P (max

1≤i≤Nkλik > ) ≤

Referenties

GERELATEERDE DOCUMENTEN

Beslissingen voor waterveiligheid en zoetwater, met strategieën voor de kust, voor het IJsselmeergebied, de Wadden, de Zuidwestelijke Delta, en natuurlijk ook voor de Maas, die

The seven main elements of Broad-Based Black Economic Empowerment as described by the Construction Sector Charter - Broad-Based Black Economic Charter - Version 6 (2006:8) are

To summarize, we have measured the local azimuthal ve- locity in a turbulent Taylor-Couette flow with various amounts of counter-rotation using laser Doppler anemometry.. We found

Niet omdat bodem- weerbaarheid voor deze groepen pathogenen niet van belang is, of niet wordt onderzocht, maar ge- woon omdat we niet alles in een keer kunnen behandelen.. Er is dus

Het adviesrapport wordt afgesloten met een aantal algemene conclusies en aanbevelingen: • Zowel kwantiteit als kwaliteit zijn van belang voor het behoud van zeldzame rassen; •

Een manier waarop de moeilijkheid van woorden voor kinderen bepaald kan worden, is door aan de leerkrachten te vragen welke woorden voor kinderen relatief moeilijk te lezen zijn..

The momentum returns of the low and high Ivol tercile portfolios have a negative exposure to the SMB size factor, while the momentum return of the medium Ivol portfolio has a positive

The smallest size and highest book-to-market equity portfolio () and the largest size and lowest book-to-market equity portfolio () are highlighted in the