Nuclear Norm Regularized Estimation of Interactive Fixed Eﬀects Panel Regression Models: a Monte Carlo Approach

(1)

Nuclear Norm Regularized Estimation of Interactive

Fixed Effects Panel Regression Models: a Monte

Carlo Approach

(2)

(3)

Nuclear Norm Regularized Estimation of Interactive

Fixed Effects Panel Regression Models: a Monte

Carlo Approach

Lucian B. Franzky

July 2019

Abstract

(4)

1 Introduction

Panel regression models with interactive fixed effects are an interesting topic at the frontier of a rapidly developing field of econometrics, which has a growing overlap with machine learning. These models assume that the observed panel of some dependent variable is not only determined by some observed panels of explanatory variables, but also by some general factors per time period and the individuals’ responses to these factors, i.e. their respective factor loadings. The interactions of these factors and loading are also called interactive fixed effects.

Such models can be represented in matrix form as:

Y=

K

Õ

k=1

βkXk+ λ f0+ E, (1)

where Y, Xk ∈ RN ×T are the observed variables, λ ∈ RN ×R are the individual factor

loadings, f ∈ RT ×R are the factors and E ∈ RN ×T contains the error terms. The parameters βk ∈ R, k = 1, 2, ..., K, are assumed to be constant over time and for all individuals, the factor

loadings λi,r, i = 1, 2, ..., N, r = 1, 2, ..., R are assumed to be constant over time and the factors

ft,r, t = 1, 2, ..., T are assumed to be constant for all individuals.

The interactive fixed effects Γ = λ f0 can be estimated by Principal Components Analysis. PCA is based on the Singular Value Decomposition of the residual matrix of the regression, which divides these residuals in the product of three matrices: UΣV0, U ∈ RN ×R, Σ ∈ RR×R and V ∈ RT ×R, in order to reduce its complexity. The matrix Σ is diagonal with the residual matrix’ singular values on the diagonal in decreasing order. The data are this way summarized into R components that summarize them best, where the elements in Σ give the summarizing strength of each component while the elements in U give the corresponding individuals’ response to this component and the elements in V give the time period’s response to the component. The combination of these matrices is unique and any matrix can be rewritten as a Singular Value Decomposition. Using this method for some choice of the number R, thus rewrites the matrix to a form where as much of its variation as possible is captured in the chosen number of (principal) components, making this method ideal for the estimation of the interactive fixed effects.

In a recent paper, H.R. Moon and M. Weidner (2019) derive two new estimators for this kind of models with unobserved interactive fixed effects, using Nuclear Norm Regularization. The Nuclear norm of some matrix is defined as the sum of all singular values of that matrix, thus relating to the interactive fixed effects structure in a natural way. In the paper, two estimators are proposed. The first minimizes the Nuclear Norm of a matrix objective function for the residuals of the regression of the dependent variable on its regressors. The second adds a Nuclear Norm penalization term for the matrix of interactive fixed effects to the Least Squares objective function, thus taking this structure into account more explicitly. The motivation for doing so is that it circumvents the possibility of ending up at a local extreme point of the objective function rather than the global, which is a possibility when using the more traditional estimation techniques for these kind of models. Those techniques are usually based on Least Squares estimation, which not always has a convex objective function. Figure 1 gives an example where the Least Squares objective function seldom will converge to the global minimum, while the Nuclear Norm Penalized objective function will at all times.

(7)

Figure 1: Example from Moon and Weidner (2019) where the values of the objective functions for LS estimation LR(β) and the NN Penalized estimation Qψ(β) both are plotted using the same

scaling. The true parameter is β0 = 2.

non-convex contraint. The convex relaxation of it that is used for the new estimators, states that instead the Nuclear Norm of Γ should be below some number C. The penalization parameter ψC

that is used for the estimation is a transformation of this number, thus relating the constraint to the objective function. This technique, though reducing the speed of convergence, makes the objective function strictly convex under appropriate regularity conditions on the regressors, which means that it has a unique local minimum that is also the global minimum. In addition to this, Moon and Weidners (2019) approach gets rid of two assumptions that have to be made for the successful Least Squares estimation of the underlying model. First, LS requires that the number of factors is known and minimizes the objective function taking into account the presence of the corresponding matrix of interactive fixed effects. Even though this number R can be estimated based on the data, it is more convenient when the estimator does not require such additional input. This is the main significance of the Nuclear Norm Minimizing estimator as it already produces a residual matrix with the lowest Nuclear Norm possible, i.e. it estimates the parameters in such a way that the regression residuals contain as few and weak factors as possible. It thus needs neither the number of factors nor the penalization parameter to be specified in order to consistently estimate the parameters β. As was proven in Moon and Weidner (2019): limmin(N,T)→∞βNNmin = β.

Based on this estimate, a second inconvenience of the Least Squares estimator is tackled, namely that the regressors Xk should be of high rank. Using the Nuclear Norm Minimizing results, an

upper bound for the number of factors is chosen and using this the penalization parameter ˆψ and the coefficients are estimated. It can then be shown that limmin(N,T)→∞βNNpen = β without any

restrictions on the rank of the regressors, implying it is applicable to both low-rank and high-rank regressors. Both results are linked by the property that limψ→0βNNpen= βNNmin.

(8)

reach asymptotic equivalence to the LS estimator in smaller panels as well, a routine is proposed that reaches statistical indistinguishability of the estimates after at least two iterations. For the determination of the number of factors that should be extracted, which is needed for this so-called Post Estimation routine, a new method is proposed to determine this number based on the estimate

ˆ ψ.

The paper provides all the proofs and theory regarding the new estimators, but lacks a broad study regarding their behavior or an application evaluating their performance on real data. This study’s primary achievement is the extension of the setup in Moon and Weidner (2019) to an extensive Monte Carlo study simulating various circumstances. Such circumstances involve the sensitivity to increases in the number of variables K and the number of factors R, as well as to different relative weights given to the relative effect size of the independent variables (the parameters) and to different levels of unbalancedness in the data (the panel dimensions). We also test the performance under various error structures, such as cross-sectional correlation, autocorrelation and heteroskedasticity. Moon and Weidner (2019) assume i.i.d. error terms in all explanatory and dependent variables, linking them only through their common dependence on the factors. However, in practice it is seldom the case that one finds variables having such a relation, thus justifying the relevance of an analysis regarding the sensitivity to such error structures. Next, as the authors only look at static factor models, we create dynamic factor models and test the estimators under some scenarios regarding the factor structure. In order to conduct this analysis, we implement the new routines in R.

We proceed to evaluating the proposed method to estimate the number of factors to be extracted, comparing its performance to the methods derived by Ahn and Horenstein (2013) and Bai and Ng (2002), this way testing the justification for the introduction of a new estimator in this field. Finally, the estimators are applied to a Finance data set, to evaluate their performance in empirical research.

In all scenarios, the five estimators (Nuclear Norm Minimization, Nuclear Norm Penalization and Post Estimation in 1, 2 and 3 steps respectively) are compared to the simple Pooled OLS and the Least Squares estimator that under certain conditions fails to converge to the global minimum. Because of this, the latter estimator requires the estimation routine to start from a number of different initial guesses for the parameter in order to find the global minimum.

Our simulation results indicate that the new Nuclear Norm Regularized estimators generally perform well and under certain scenarios the Post Estimation even provides better convergence results to the true value than the optimal Least Squares estimators do, e.g. in the case of considerable levels of heteroskedasticity. We however also find that the Post Estimation based on the Nuclear Norm Regularized estimates not always converges to the true global minimum, e.g. for high levels of autocorrelation in the variables or for highly skewed panel dimensions, thus seemingly contradicting the theory as the routine lets the estimates converge to local extreme points instead. Overall the newly proposed estimators nonetheless yield an efficiency gain, especially in scenarios where "one-shot" Least Squares estimation (i.e. letting the routine converge just once) tends to converge to widely varying estimates. The Nuclear Norm Regularized estimators provide a method that purposefully finds the global minimum in almost all situations, so that the researcher does not have to compare multiple estimation results before progressing. Our results further indicate that the new estimators can safely be applied on dynamic panels as well.

(9)

balanced panels of sufficiently large dimensions are available. As an innovation, we increase the effectiveness by combining the estimator with the internal upper boundary derived by Ahn and Horenstein (2013) instead of an arbitrary number given a priori by the researcher such as Moon and Weidner (2019) does.

(10)

2 Literature Review

Nuclear Norm Penalized estimation is extensively being used in machine learning and statistical learning. In these fields of research, the parameter of interest is the matrix of interactive fixed effects from equation (1). The method is mainly used for reduced rank regressions (e.g. Rohde and Tsybakov (2011)) or for matrix completion (e.g. Recht, Fazel and Parrilo (2010)). It is also discussed in the literature on efficient estimators, which is important in machine learning. E.g. Jaggi and Sulovsky (2010) finds a relatively fast general algorithm for Nuclear Norm problems which works especially better for very large matrices. While being free of tuning parameters, this method comes with strong convergence guarantees and is easy to parallelize. More concretely designed for matrix completion is the algorithm proposed by Hastie et al. (2015) that builds upon two proposed approaches to solve the matrix-completion problem, combining them to an efficient matrix factorization and completion method.

In the last years, Nuclear Norm Penalized estimation has found its way to econometric research where e.g. Athey et al. (2018) applies it for treatment effect estimation and develops a class of Matrix Completion estimators that can be used to derive counterfactual entries in panel data where the untreated outcomes for the treated subgroup are missing. The approach uses the control outcomes for the untreated subgroup to impute these values and creates a matrix that well-approximates the original one but has a significantly lower complexity according to the Nuclear Norm. The proposed estimator minimizes the squared Frobenius norm of the matrix difference, just as in Least Squares estimation, with the product of a penalization term and the Nuclear Norm of the imputed matrix added to this minimization, thus forcing its complexity to decrease in a similar way to the method we evaluate in this paper.

For all the methods concerning Factor Models, in order to extract these factors from the data, the Principal Components technique is applied. This method reduces the dimensionality of large data sets while preserving a portion of the variance-covariance matrix that is as large as possible for the number of factors that is used. This is achieved by summarizing the original data (interrelated variables) in fewer unrelated variables: the Principal Components. The first paper where this Principal Components method is combined with the linear regression of Y on X, is in the paper by Coakley et al. (2002) where a method is proposed that extracts the factors by Principal Components from initial OLS residuals, and then obtains the final estimates in a second step where the extracted factors are treated as observables in an augmented regression of Y on X and these factors. However, this method appeared to be very susceptible to mistakes, so building on it, Bai (2009) proposed a better estimator that repeats these steps until convergence. He also shows that the estimator obtained by applying this method will be consistent even if the explanatory variables and the extracted factors are correlated, when both N and T tend to infinity, irrespective of their ratio, which is a strong result.

(11)

of the common components by shrinking its singular values. They show that the extracted factors, though biased, can be more efficient compared to the LS estimates in terms of Mean Squared Errors. Also they derive two new methods to estimate the number of factors. Moon and Weidner (2017) shows that the Least Squares estimator has asymptotic bias and provides a bias-corrected LS estimator. Practical applications of such models are given by e.g. Gobillon and Magnac (2016) who apply the LS estimator for regional policy estimation.

The estimators however all require the researcher to determine the number of factors to be extracted from the data a priori. As an (educated) guess from the researcher is usually a bad initialization for routines, a data-driven estimator for the number of factors is preferred. In order to estimate this number, several methods have been proposed over the years. Bai and Ng (2002) pioneered with two data driven methods that both minimize a certain selection criterion. They developed PC and an IC estimators that both use eigenvalues of the second-moment matrix Y to determine the most appropriate number of factors to be extracted. Especially their PC estimator has been hugely popular. This estimator simply counts the number of factors as the number of eigenvalues that exceed some threshold penalty function. The most important finding of Bai and Ng is that as the convergence depends on min(N,T), so should the threshold function. There are however some issues: firstly threshold functions are never uniquely specified and second these methods still need a given maximum number of factors. Ideally, the estimators should not be too sensitive to misspecification of this upper bound. Onatski (2010) introduced the so-called ED estimator that uses differenced adjacent eigenvalues when ordered descending.

Ahn and Horenstein (2013) propose two new data-driven methods to determine the number of factors: the eigenvalue ratio and the growth ratio. The eigenvalue ratio estimator maximizes the ratio of adjacent eigenvalues in descending order, while the growth ratio maximizes the ratio of the growth rates of residual variances when one fewer principal component is extracted. Both estimators are easy to compute. In the paper that presents these estimators, their performance is tested under various scenarios for the simulated data sets. As additional estimators the authors use the BIC3 estimator of Bai and Ng (2002), the ED estimator of Onatski (2010) and the ABC estimator by Alessi (2010) to compare the performance of their new methods to. They find that for cross-sectionally correlated errors, their estimators perform at least as good as the others in cases where N = T ≥ 75. When the factors are weak, the estimators are no longer perfectly accurate for large N and T, but still outperform the other estimators. Also, very erroneous specification of the maximum number of factors can lead to extreme errors in the computation of the number of factors from the data, especially under the BIC3 method from Bai and Ng (2002). The newly proposed estimators still show some receptiveness to this kind of misspecification, but even for large maximum values this error stays below 4. Ahn and Horenstein write that under all circumstances it is important to avoid choosing an excessively large maximum value and therefore propose an upper bound for this. In their final simulation, they let one factor have a dominantly strong explanatory power compared to the others. In this scenario the ER estimator performs weakly and is outperformed by both the ABC and the ED estimator. The GR estimator however performs quite well when the data sets are sufficiently large (N = T ≥ 150).

(12)

3 Estimators

We introduce some notation that will be used in the remainder of the text. First, the operator ˜µr(·)

gives the r-th largest eigenvalue of some matrix and the operator sq(·) gives the q-th largest singular

value of a matrix. The rank of some matrix, i.e. the number of its non-zero singular values, is given by the operator k·k0and the Nuclear Norm, which is the sum of a matrix’ singular values, is

given by the operator k·k₁. Finally the notation for the Frobenius Norm, which is the square root of all of a matrix’ entries squared and summed up (i.e. (ÍN

i=1ÍTt=1(·)2it)1/2, is given by k·k2. The

Projection matrix operator P(·), when applied to some matrix or vector A, gives the orthogonal

projection matrix, i.e. A(A0A)−1A0 and the operator M(·) gives the difference of the identity

matrix and the orthogonal projection matrix of that given vector or matrix, i.e. MA= I − PA. All the following estimators are applied to estimate models of the structure defined in equation (1) while varying in the extent of taking the Interactive Fixed Effects into account. The panels Y and Xk, k= 1, 2, ..., K are given and based on them the parameters βkare estimated.

3.1 Pooled Ordinary Least Squares

This is the most straightforward estimator, where the columns of the matrices Y and Xkare stacked

on top of each other in order to create the (NT × 1)-vectors Y• and X•(k). The K vectors X•(k)are

then combined as the K columns of X•∈ RNT ×Kand the pooled estimator is given by

ˆ

βPOLS= (X0•X•)−1(X0•Y•). (2)

We consider this estimator, rather than the Fixed Effects estimator, as it assumes that the intercepts are homogeneous, i.e. β0,i = β0 for all i. This is the case for all the data generating processes

we consider, so treating the intercepts as free parameters adds unnecessary complication to the estimation and therefore the FE estimator will not be considered. The POLS estimator however ignores the possible factor structure in the data. It is often used as educated guess for the initialization of the other routines, rather than as serious estimate of the underlying parameters in factor models. These routines then do take into account the factor structure in the data, iterating from the pooled OLS estimate to an estimate with the interactive fixed effects considered. The Pooled OLS estimator is unbiased and consistent if the variables x_it(k)are strictly exogenous and the intercepts are homogeneous. The estimates remain consistent for heteroskedastic errors or temporal dependence, for T fixed and N large. All relevant proofs can be found in standard statistics textbooks, e.g. Pesaran (2015).

3.2 Least Squares

The Least Squares method treats the factors and factor loadings of the interactive fixed effects as parameters. The squared Frobenius norm of the difference between the terms on both sides of equation (1) is then minimized by means of an iterative routine. Given the definition of the Frobenius Norm, minimization of its square is equivalent to Least Squares estimation. The objective function is correspondingly given by:

(13)

Under regularity conditions, especially the exogeneity of Xk with respect to E, it is known

that this estimator is√NT-consistent and asymptotically normal as N, T → ∞ at the same rate, as was proven by Bai (2009). It has a bias in the limiting distribution that can be corrected for. The routine requires an initial starting value for the beta-parameters, for example the pooled OLS estimate described in Section 3.1. This value is then used to compute the first-step residuals

Y −ÍK

k=1βkXk, from which the estimates ˆλ and ˆf are subsequently extracted as the first R

principal components. This number R is either given by the researcher or determined using a data-driven method. More on this in Section 3.6. The estimates are then used to project them out of the matrices Y and Xk by rescaling these matrices. The rescaled matrices are then used to

obtain a pooled OLS estimate for the beta parameters applying the method from the section above, which is used to update the initial starting value. The procedure is repeated until convergence of the parameter estimates.

This algorithm however, does not necessarily reduce the objective function in each step. As an increasing objective function implies that we are far away from a minimum, it is better to rerun the iteration with a different initialization, to prevent it from going into the wrong direction. Therefore, next to convergence in the parameter estimates, a decreasing objective function is the second criterion that has to be met for an optimal solution. But even in case of convergence, the routine is restarted with other parameter initializations a couple of times, which results in numerous optimal results from which the one with the lowest value for the objective function in equation (3) is selected. This guarantees that the minimum that was found is the global one. It is necessary to do this as the objective function is non-convex, which means that it can have multiple local extreme points, as described in Section 1. All things considered, the algorithm works as follows:

1. Some parameter initialization is given, we let ˆβ(0)_k = ˆβk,POLS+ N(0, 1), k = 1, 2, ..., K.

2. Using these initial values, the first R factor loadings and the factors are obtained from the residuals by the method of Principal Components, using the most recent estimate for β:

( ˆλ(r+1), ˆf(r+1))= arg min λ, f Y − K Õ k=1 ˆ β(r)_k Xk− λ f0 2 2 .

3. The value of the objective function is computed as the trace of the matrix E? =

Y −ÍK

k=1βˆk(r)Xk − ˆλ

(r+1)_ˆf0(r+1)_{, i.e. the sum of its diagonal elements.}

4. We check whether the objective function decreased. If not, the algorithm is restarted with a different initialization.

5. The estimated factors and factor loadings are projected out of the matrices Y and Xk k=

1, 2, ..., K in the following way: Y?= Mλˆ(r+1)YM_ˆf(r+1)and X_k?= Mλˆ(r+1)XkM_ˆf(r+1).

6. The columns of these matrices are then stacked, similar as in section 3.1, and the stacked matrices are used to obtain an update for the parameter estimates: βˆ(r+1) = (X?•0X?•)−1(X?

0

•Y•?).

(14)

To get a reasonable coverage of the parameter space and increase the chance to have the global minimum among the estimates, the whole procedure is repeated 50 times. The estimate corresponding with the smallest value for the objective function is reported as ˆβLS. We also report

the "one-shot" LS estimate, which is the estimate obtained when the routine described above is run just once. This estimate ˆβOSLSis interesting as it can give a local extreme point, while the

Nuclear Norm Regularized Estimators always give the global minimum at the first shot.

3.3 Nuclear Norm Minimization

The Nuclear Norm Minimization estimator, is given by the arguments that minimize the Nuclear Norm of the residual matrix Y −ÍK

k=1βkXk. The algorithm used to find this optimal parameter

values, starts at the pooled OLS estimates. As Moon and Weidner (2019) show, this objective function is strictly convex, hence eliminating the possibility of ending up at a local extreme point rather than the desired global minimum. Therefore, using the POLS estimate as an initialization causes no problems. After the POLS estimates were derived, the standard deviation of the corresponding residuals is used to rescale both the variables X and Y and the objective function is defined as the sum of the singular values of the residual matrix, i.e. its Nuclear Norm. Application of the Broyden-Fletcher-Goldfarb-Shanno algorithm to optimize this objective function provides the researchers then with the parameter estimates ˆβk,NNmin, k = 1, 2, ..., K. The BFGS algorithm

is an iterative quasi-Newton method for solving unconstrained nonlinear numerical optimization problems. Summarizing, the method works as follows:

1. Obtain estimates ˆβ_POLSthe same way as in equation (2).

2. Obtain the residuals ˆE?= Y − ÍK_k=1βˆPOLS,kXk and compute their standard deviation σEˆ?.

3. Define Y?= Y/(σEˆ?

√

NT )and X?_k = Xk/(σEˆ?

√

NT ), k= 1, 2, ..., K. 4. Define the objective function as βNNmin= arg minβ

Y −Í_kK₌₁βkXk

1.

5. Optimize using the BFGS algorithm with the initialization βk = ˆβPOLS,k, Y = Y? and

Xk = X?_k for k= 1, 2, ..., K.

This estimator thus provides consistent estimates ˆβ_NNminwithout the need of prespecification of some parameter such as Rmax.

3.4 Nuclear Norm Penalization

The Nuclear Norm Penalization estimator has the objective function consisting of the squared Frobenius norm of the difference between the terms on both sides of equation (1), i.e. the objective function for the Least Squares minimization from equation 3, with a penalization term added to it, consisting of the Nuclear Norm of Γ= λ f0multiplied by the penalization parameter ψ. The objective function is thus given by:

(15)

This function is always convex and the penalization parameter ψ > 0 can be chosen in the following data dependent manner using the Nuclear Norm minimizing estimates from Section 3.3, for a given choice of the upper bound for the number of factors Rmax:

ˆ E? = Y − K Õ k=1 ˆ βk,NNminXk, (5) ˆ ψ = 2s(Rmax+1)( ˆE?) √ NT . (6)

The routine used to obtain the Nuclear Norm Penalized estimates starts with an initialization for the beta-parameters: the Pooled OLS estimates. This estimate is then used to obtain the residuals, whose standard deviation is thereupon, just as in the algorithm for the Nuclear Norm Minimization, used to rescale all the panels in order to improve the algorithm’s speed. Consequently the penalization parameter ˆψ has to be rescaled as well. Finally, following the theory provided in Moon and Weidner (2019), the objective function is rewritten as the sum of this residual matrix’ rescaled singular values. The objective function is then minimized, using the Broyden-Fletcher-Goldfarb-Shanno algorithm. Summarizing, the procedure is as follows:

1. Obtain estimates ˆβPOLSthe same way as in equation (2).

2. Obtain the residuals ˆE?= Y − ÍK_k=1βˆPOLS,kXk and compute their standard deviation σEˆ?.

3. Define Y?= Y/(σ_E_ˆ?

√

NT ), X?_k = Xk/(σEˆ?

√

NT ), k= 1, 2, ..., K and ˆψ?= ˆψ/σ_E_ˆ?. 4. Define the objective function as

ˆ

βNNpen = arg min β min(N,T ) Õ q=1 ζq, (7) ζq = (ψsq(E) −ψ 2 2 if sq(E) ≥ψ 1 2sq(E) 2 _if _s q(E)< ψ for q= 1, 2, ..., min(N, T), (8) E = Y − K Õ k=1 βkXk. (9)

5. Optimize using the BFGS algorithm with the initialization β = ˆβPOLS, ψ = ˆψ?, Y = Y?

and Xk = X?_k for k= 1, 2, ..., K.

The resulting estimates ˆβNNpenconverge to the global minimum without any restrictions on the

ranks of the regressors.

3.5 Post Nuclear Norm Estimation

(16)

after three iterations the Post Estimation estimates are statistically indistinguishable from the Least Squares estimates for all scenarios they consider. The algorithm starts with either the Nuclear Norm Minimization or the Nuclear Norm Penalization estimate and estimates the R factor loadings and factors of the estimated residuals ˆE= Y − ÍK_k=1βˆkXk as the first R Principal Components.

This number of factors that should be extracted is either given or estimated based on the data. More on such estimators in Section 3.6. Next, the parameter estimate is updated by computing the Pooled OLS estimator where the estimated factors and factor loadings are projected out of the matrix X•and vector Y•, which are defined as in section 3.1. The procedure is thus as follows:

1. Use ˆβNNminor ˆβNNpenas your preliminary estimate ˆβ(0). In this study we will use ˆβNNpen.

2. Estimate using Principal Component Analysis the the R factors and corresponding factor loadings from the residuals obtained using the most recent estimate for β:

( ˆλ(r+1), ˆf(r+1))= arg min λ, f Y − K Õ k=1 ˆ β(r)_k Xk− λ f0 2 2 . 3. Update the estimate:

ˆ β(r+1)= arg min β X•0 M_ˆf(r+1)⊗ Mλˆ(r+1) X• −1 X0• M_ˆf(r+1)⊗ Mλˆ(r+1) Y•. (10)

4. Repeat the above for a fixed number of repetitions p, p ≥ 2.

This algorithm is thus a finite number of loops of the LS-routine described in section 3.2, with a Nuclear Norm Regularization estimate as initialization. Because this estimate is used, one does not need to check whether the objective function is decreasing or rerun from different starting points, as the iteration is guaranteed to start in the close neighborhood of the optimum.

3.6 Estimators for the Number of Factors

Both the Least Squares and the Post Nuclear Norm estimator require the number of factors as an input parameter R, and also the data-driven choice for the penalization parameter ψ requires specification of a maximum number of factors Rmax. Both numbers can be given by the researcher,

or can be estimated based on the data. For the latter, Moon and Weidner (2019) gives a new estimator for the number R, given below in equation (11). It is joined by four other estimators for the number of factors that have been derived by Ahn and Horenstein (2013) and Bai and Ng (2002).

3.6.1 Moon and Weidner

The newly proposed estimator, thresholds the singular values of the residual matrix from equation (5) using the estimate ˆψ from equation (6).

(17)

Their motivation is that this estimator uses the information that singular values of ˆE∗ that are

significantly larger than√NT ˆψ should correspond to factors, while singular values close to it or smaller likely originate from noise.

3.6.2 Eigenvalue Ratio

The first estimator that is proposed by Ahn and Horenstein (2013) is the Eigenvalue Ratio Estimator. ˆ RER= arg max r ˜ µr( ˆE?) ˜ µ_r+1( ˆE?) , r= 1, 2, ..., Rmax. (12)

The estimator is thus simply the ratio of two adjacent eigenvalues of the squared residual matrix.

3.6.3 Eigenvalue Growth Rate

In addition to the Eigenvalue Ratio Estimator, Ahn and Horenstein (2013) proposes the Eigenvalue Growth Rate Estimator.

ˆ RGR= arg max r ln[V (k − 1)/V (k)] ln[V (k)/V (k+ 1)], r= 1, 2, ..., Rmax, (13) where V (k)= min(N,T) Õ j=k+1 ˜ µj( ˆE?). (14)

The estimator thus maximizes the ratio of two adjacent growth rates when one more principal component is used.

3.6.4 Bai and Ng 1

Bai and Ng (2002) were the first to consider data-driven estimators to determine the number of factors to be extracted. From the three estimators they propose, two perform exceptionally well. The first one is:

ˆ

RBIC1= arg min r " ln(W(r))+ rN+ T NT ln NT N+ T # , r= 1, 2, ..., Rmax, (15) where W (r)= 1 NT N Õ i=1 T Õ j=1 ˆ E?i,t− ˆλi(r) ˆf 0 t(r) 2 . (16)

(18)

3.6.5 Bai and Ng 3

The final estimator that we will consider is the third estimator that was proposed in Bai and Ng (2002).

ˆ

RBIC3= arg min r " ln(W(r))+ rln(min(N,T)) min(N, T ) # , r= 1, 2, ..., Rmax. (17)

(19)

4 Simulation Setup

Just as in Moon and Weidner (2019) we simulate the data sets following a Linear Panel Regression model with Interactive Fixed Effects that allows for any number of regressors and factors. The basic data generating process is given by the following equations:

Yi,t = β0+ K −1 Õ k=1 βkX_i,t(k)+ R Õ r=1 λi,rft,r+ Ei,t, (18) X_i,t(k)= 1 + R Õ r=1 (λi,r+ λ(X)_i,r)( ft,r+ f(t−1),r)+ E(X (k )₎ i,t , (19)

where λi,r, λ(X)_i,r i.i.d.

∼ N (1, 1) for i= 1, 2, . . ., N, and r = 1, 2, . . ., R. For each of the scenarios we consider, 1000 data sets are simulated, from which the estimators are derived. For convenience we add a matrix X0to the estimation, containing ones only. This allows us to estimate the constant

β0as well, without having to alter the setup of the estimators.

4.1 i.i.d. Errors and Factors

Initially, just as in Moon and Weidner (2019) we let Eit, E(X

(k )₎

i,t

i.i.d.

∼ N (0, 1) and ft,r∼ N (0, 1)i.i.d.

for t = 1, 2, . . ., T and k = 1, 2, . . ., K. In this setup of independent and identically distributed factors and error terms, all estimators are consistent and we evaluate the their sensitivity to increases in the numbers of factors and variables, to different values of the parameters βk, k = 1, 2, ..., K

and to different numbers of individuals N and time periods T .

4.2 Cross-Sectionally Correlated Errors in the Explanatory Variables

For the following scenarios, we fix T and N to 100 and the parameter values βkto 1, k= 1, 2, ..., K.

We drop the assumption of independently and identically distributed error terms and simulate some particular error structures. First, we simulate a scenario very similar to the one above, only that in this case there is cross-sectional correlation between the explanatory variables of the same time period. Such a scenario is the case when e.g. parent’s income and parent’s education are both used to explain an individual’s income. In order to see the impact of the cross-sectional correlation we set the number of explanatory variables to four (because the first variable is always the constant), while still assuming two factors. Following the same setup as above, this means that all the error terms now follow the following multivariate distribution:

            Ei,t E_i,t(X(1)) E_i,t(X(2)) . . . E_i,t(X(K −1))             i.i.d. ∼ N 0,            1 0 0 0 · · · 0 1 ρ ρ · · · 0 ρ 1 ρ · · · 0 ρ ρ 1 · · · .. . ... ... ... . ..            , i= 1, 2, ..., N, t = 1, 2, ..., T. (20)

(20)

4.3 Autocorrelated Errors

Under the scenario of autocorrelated error terms, the errors are correlated with lagged values of themselves. This is the case when external shocks to the environment have an impact that not only affects the current time period, but through the serial correlation has an impact on future errors as well. This impact however dies out over time. We assume that the errors in both the dependent and explanatory variables have autocorrelation of the same magnitude. We define Ei and E(X

(k )₎

i

for i= 1, 2, ..., N and k = 1, 2, ..., K as the T × 1-vectors containing the elements Ei,t and E(X

(k )₎

i,t

respectively. Autocorrelation means that these error terms follow a multivariate distribution of the following form:

Ei, E(X (k )₎ i i.i.d. ∼ N 0,            1 ρ ρ2 ρ3 · · · ρ 1 ρ ρ2 · · · ρ2 _ρ ₁ _ρ _{· · ·} ρ3 _ρ2 _ρ ₁ _{· · ·} .. . ... ... ... . ..            , i= 1, 2, ..., N, k = 1, 2, ..., K. (21)

Again, the estimators should remain consistent, but high autocorrelation could affect the rank of the regressors causing the Least Squares estimates to be biased. We fix the number of variables and factors to be 2 and let the autocovariance vary between 0.2 and 0.8.

4.4 Heteroskedasticic Errors

In the next scenario, we create heteroskedasticity in the dependent variable. This is done by multiplying its error term by the exponent of one of the factors, thus making it dependent on the realization of this random variable. Just as in the i.i.d. case, we consider two explanatory variables and two factors and let E_i,t(X(k ))∼ N (0, 1) and fi.i.d. t,r∼ N (0, 1) for ti.i.d. = 1, 2, . . ., T

and k= 1, 2, . . ., K. In order to induce the heteroskedasticity, we let Eit =

0.25+ α exp( ft,2), i.i.d.

∼ N (0, 1). (22)

This way, different levels of the factor cause different levels of heteroskedasticity, as is illustrated in Figure 2. The factor 0.25 is arbitrary and insures that even in the case of very negative values of the factor, there still is some (even if very little) error in the dependent variable. Without it there would be none, which is unrealistic.

(21)

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● −3 −2 −1 0 1 2 3 −2 −1 0 1 2

Distribution of the Error

Value of factor f_2 V alue of Error T er m (a) α= 0.1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● −3 −2 −1 0 1 2 3 −4 −2 0 2 4

Value of factor f_2 V alue of Error T er m (b) α= 0.25 ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −3 −2 −1 0 1 2 3 −5 0 5

Value of factor f_2 V alue of Error T er m (c) α= 0.4

Figure 2: These plots the increasing magnitude of the heteroskedasticity for different values of the heteroskedasticity parameter. Note the different levels on the scale of the Y-axis.

4.5 Autorelated Factor

In the scenarios described above, only the error structures were altered. We now move to the factor structure and alter the data generating process in such a way that one of the factors is related with its first lag, making the structure partially dynamic. We again assume two variables and two factors, but now while only one of the factors has elements f1,t∼ N (0, 1) for ti.i.d. = 1, 2, ..., T, the

other is defined as

f2,t = α f2,t−1+

p

1 − α2_, i.i.d._{∼ N (0, 1).} ₍₂₃₎

We let the error terms again be Eit, E(X

(k )₎

i,t

i.i.d.

∼ N (0, 1) and vary the level of factor autorelation α between 0.2 and 0.8. The Least Squares estimator should remain consistent under this scenario but the increased strength of the factors will likely affect the performance of the Nuclear Norm Regularized estimators.

4.6 Dynamic Factor Model

We then move from to fully dynamic factor models. For these, the factors fr,t, t= 1, 2, ..., T and

r = 1, 2, ..., R are no longer assumed to be random independent and identically generated, but instead follow a time-series process of the following vector autoregressive form:

f_t= Ψ(L) ft−1+ , i.i.d.

∼ N (0, 1), (24)

where ftcontains the elements fr,t. We fix the number of variables and factors again to be two,

and decide to include only the first lag for the autoregressive process generating the factors. The matrix Ψ(L) then looks as follows:

Ψ(L)=

_{χ φ}

φ χ

.

(22)

4.7 Lagged Dependent Variable

We then let the data generating process include lags of the dependent variable. The number of variables and factors is again set to two and we assume i.i.d. errors and factors. For this model, equation (18) now becomes:

Yi,t = L Õ l=1 ξlYi,t−l+ β0+ K −1 Õ k=1 βkX_i,t(k)+ R Õ r=1 λi,rft,r+ Ei,t. (25)

We set L = 1 and let ξ vary from 0.4 to 0.8. All estimators should remain unbiased.

4.8 Cross-Sectionally Correlated Errors between the Dependent and

Explanatory Variables

For this last scenario, we still keep the number of factors and explanatory variables fixed to be two but now allow the error terms in X to be cross-sectionally correlated with the error term in Y from that same period and individual. Note that this is highly similar to the scenario from Section 4.2. It means that the error terms follow the following multivariate distribution:

            Ei,t E_i,t(X(1)) E_i,t(X(2)) . . . E_i,t(X(K −1))             i.i.d. ∼ N 0,          1 ρ ρ · · · ρ 1 0 · · · ρ 0 1 · · · .. . ... ... . ..          , i= 1, 2, ..., N, t = 1, 2, ..., T. (26)

Note that now all the estimators will become inconsistent because the exogeneity of the variables Xk, k= 1, 2, ..., K with respect to the error in Y is no longer met, but we check if some

react more heavily to this setup than others. We vary the covariance parameter between 0.2 and 0.8.

4.9 Estimators for the Number of Factors

After evaluating the performance of the estimators, we continue our Monte Carlo study by evaluating the proposed estimator for the number of factors present in the data. Note that all estimators defined in equations (11) - (17) require a specified maximum number of factors, Rmax.

Moon and Weidner (2019) arbitrarily sets this number equal to 10 as this is more than enough for all the cases they consider. We however will follow Ahn and Horenstein (2013) and use their suggested choice for Rmaxas it performed best during their Monte Carlo Study. First, we find

Rtemp= min(N,T) Õ k 1 ˜µk( ˆE?) ≥ V (0)/min(N,T), k ≥ 1 _,

where V (k) is defined as in equation (14). Then, in a second step, we let Rmax =

min Rtemp, 0.1min(N,T). This is used as the upper boundary for the number of factors considered

(23)

(24)

5 Results

5.1 i.i.d. Errors and Factors

We plot the root mean squared errors in the parameters βk, k= 0, ..., K, so including the constant

β0which receives (in the case of POLS, NNmin and NNpen) such a weight that the expected error

term is zero. This results in it being biased in the opposite direction and at the magnitude of the sum of the bias in the variable parameters as throughout the entire study the true constant is kept at unity. The plotted RMSEs do not show the direction of the bias, but the precise estimates are reported in tables that can be found in the appendix.

● ● ● 2 3 4 5 6 0.05 0.10 0.15 0.20 0.25

Behavior of the Estimators

Number of Factors RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

(a) Mean Errors for 2 Variables

● ● ●

2 3 4 5 6

0.05

0.10

0.15

(b) Mean Errors for 4 Variables

● ● ● 2 3 4 5 6 0.02 0.04 0.06 0.08 0.10 0.12

(c) Mean Errors for 6 Variables

Figure 3: These plots show the behavior of the RMSE in the different estimates for an increasing number of factors.

Figure 3 shows that increasing the number of factors while keeping the number of variables constant, causes the error in the Nuclear Norm Regularized estimates to increase. While the LS and POLS estimators remain unaffected in their precision, the RMSE in these parameter estimates thus increases in the number of factors. This is logical, as putting a high weight on the factors is penalized, thus forcing the algorithm to put an excess of weight on the parameter estimates instead in explaining the data. Interestingly, for high numbers of factors and variables, the RMSE in the Nuclear Norm Regularized estimates will eventually overtake the error in the POLS results. This apparent sensitivity of the newly proposed estimators to the number of variables and factors in the data is an inconvenient property that has no effect on the Least Squares estimators. It is also notable that the POLS, NNmin and NNpen estimators always overestimate the variable parameters, while the LS estimates have very small error in any direction. As Figure 3(b) shows, the " LS estimate diverges from the LS estimate, indicating that the routine converges to multiple local extreme points. The standard deviations of all estimators decreases in the number of factors present in the data. As expected, the Post Estimation steps converge to the LS estimates after three steps.

(25)

● ● ● 2 3 4 5 6 0.05 0.10 0.15 0.20

Number of Variables RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

(a) Mean Errors for 2 Factors ● ● ● 2 3 4 5 6 0.05 0.10 0.15 0.20 0.25

(b) Mean Errors for 4 Factors ● ● ● 2 3 4 5 6 0.05 0.10 0.15 0.20 0.25

(c) Mean Errors for 6 Factors

Figure 4: These plots show the behavior of the RMSE in the different estimates for an increasing number of variables.

remains roughly constant as well, when increasing the number of variables. Figure 4(c) again shows the convergence to local extreme points when the Least Squares routine is run just once.

The above results show that while the NNmin and NNpen estimators take a much longer time than LS, even when one takes the safe way and runs that routine 50 times, always provide estimates that converge to the true global minimum after three steps. When aiming to reach this minimum with one shot, these estimators with the Post Estimation steps are therefore an attractive option, as the Least Squares method requires multiple initializations in order to be sure the global minimum is reached. The detailed results in the appendix show that while the error in the "one-shot" LS estimate is considerably less than the error in the Nuclear Norm Regularized Estimators, this method on average converges to a different extreme point, i.e. not the global minimum we are aiming to reach.

The results described solely considered scenarios where all variables had the same relative effect size on the dependent variable. In real data this is however highly unlikely. Rather, one of the variables has a much stronger relative effect size than the other. Therefore, we simulate data for a couple of ratios between the two parameters: 1:16 (β1= 0.125, β2= 2), 1:8 (β1= 0.125, β2= 1),

1:4 (β1 = 0.25, β2 = 1) and 1:1 (β1 = β2 = 1). In all the scenarios the constant remains of

magnitude 1. Figure 5 gives the error in the corresponding estimates using our set of estimators. As we see, the mean error in the estimates is not affected by a difference in relative effect size of the variables and when looking at the variance, there is no difference neither. There is one slight increase in the RMSE of the LS estimates for β1= 0.125, β2= 1 that we have no explanation for.

For the rest all errors remain exactly constant for all scenarios indicating insensitivity of the (new) estimators to these variations.

(26)

● ● ● ●

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.05

0.10

0.15

Ratio scenarios beta1/beta2

RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

Figure 5: Plot of the behavior of the RMSE in the different estimates for different ratios of the two parameter variables. The scenario’s are plotted on the x-axis: under scenario 1, β2equals 16 times

β1, under scenario 2 eight times, under 3 four times and under 4 they have the same initialization.

● ● ● ● ● 1 2 3 4 5 0.05 0.10 0.15 0.20 0.25

Ratio scenarios (T/N) RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

Figure 6: Plot of the behavior of the RMSE in the different estimates for different ratios of the panel dimensions. The scenario’s are plotted on the x-axis: under scenario 1, the (T/K)-rate is (10/1000), under scenario 2 (20/500), under 3 (100/100), under 4 (500/20) and under scenario 5 (1000/10).

(27)

(T/N)-ratio let the error during the Post Estimation steps decrease faster than for a low ratio. This indicates that for the proposed steps based on the Nuclear Norm Regularized estimates it is worth more to have long panels of few individuals rather than short panels of many individuals. The Pooled OLS and the LS estimates are not affected by changes in the panel dimensions. We also note that in the case of highly unbalanced panels, three steps of the Post Estimation are not enough to reach the global minimum. It rather is the case that the steps converge to a different extreme point as there is no notable difference between step 2 and 3. This seemingly contradicts Moon and Weidners theory about the convergence of their estimators to the one and only global minimum for these extreme cases. Rather, the Nuclear Norm Regularized estimates appear to be so far off the global minimum that even the Post Estimation steps fail to bring it back and instead it converges to local extreme points. These steps even increase the standard deviation in each step as the results in the appendix show. The convergence of the Post Estimation thus appears to depend on the quality of the Nuclear Norm Regularized estimate it starts from. Table 1 shows the huge range for these estimates in the case of highly unbalanced panels at the third step. We consequently discourage the use of the Nuclear Norm Regularized Estimators for panel data where the number of individuals and observations differ significantly. However, as the figure indicates the "one-shot" Least Squares results are highly erroneous as well for such cases, so running multiple LS iterations is the only way to purposefully find the global minimum for such panels.

Mean Min Median Max

T/N

10/1000 0.8720 0.2741 0.8839 1.3525 100/100 0.9994 0.9475 0.9994 1.0537 1000/10 0.8760 0.7008 0.8780 1.0392 Table 1: Estimation results for β0of the POST(3) estimation.

5.2 Cross-Sectionally Correlated Errors in the Explanatory Variables

(28)

● ● ● ●

0.2 0.3 0.4 0.5 0.6 0.7 0.8

0.05

0.10

0.15

Covariance rho RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

Figure 7: Plot of the behavior of the RMSE in the different estimates for an increasing covariance between the explanatory variables.

5.3 Autocorrelated Errors

● ● ● ● 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.05 0.10 0.15 0.20 0.25

Covariance rho RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

Figure 8: These plots show the behavior of the RMSE in the different estimates for an increasing level of autocorrelation.

(29)

Regularized estimators, while the Least Squares estimates display an increasing error in the amount of autocovariance. This increase is caused by the standard deviation in the estimates for the constant term as the estimate for the variable parameter remains of very high precision. It also has a low standard deviation, which is the case for all estimates using the other methods as well. The Post Estimation converges to the LS level of error within two steps for autocovariance up to a level of 0.6. For higher autocovariance these steps seem to converge to a different extreme point, as their third step increases the RMSE compared to the second step, for high values of ρ. Again this is caused by the standard deviation of the constant term, while the variable estimate converges closer to the LS estimate in this third step as well. The detailed results can be found in the appendix.

5.4 Heteroskedasticic Errors

● ● ● ● 0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.00 0.05 0.10 0.15 0.20

Heteroskedasticity parameter alpha

Figure 9: These plots show the behavior of the RMSE in the different estimates for an increasing level of heteroskedasticity.

(30)

α LS POST(3) ˆ β0 βˆ1 βˆ0 βˆ1 0.1 -0.0021 0.0002 -0.0016 0.0001 (0.0085) (0.0025) (0.0082) (0.0024) 0.4 -0.0304 0.0107 -0.0083 0.0001 (0.0396) (0.0261) (0.0192) (0.0047)

Table 2: Estimation results for different levels of heteroskedasticity.

5.5 Autorelated Factor

● ● ● ● 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.05 0.10 0.15 0.20 0.25

Autorelation parameter alpha

Figure 10: These plots show of the behavior of the RMSE in the different estimates for an increasing level of the lag parameter in one factor.

For low levels of the lag parameter α the amount of error in the Nuclear Norm Regularized estimates is a little higher compared to the case of i.i.d. error terms. As α is increased, the error in these estimates gradually increases as well, while the error level in the Least Squares and Pooled OLS estimates remains constant. However, the Post Estimation keeps converging to the Least Squares global minimum even for high levels of α. Standard deviations slightly increase for all estimators, but remain highly similar for any initialization.

5.6 Dynamic Factor Models

(31)

● ● ● 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4

parameter chi RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

(a) Increasing χ while φ= 0.

● ● ● 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7

parameter phi RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS (b) Increasing φ while χ= 0.5.

Figure 11: These plots show of the behavior of the RMSE in the different estimates for an increasing parameterS φ AND χ.

as the error in the POLS estimates which grows at a significantly higher rate. For a parameter value of around 0.6 the POLS becomes more erroneous than its static counterpart, while the Nuclear Norm Regularized estimates keep outperforming their applications on static models until really high levels of χ. The Least Squares estimator performs best and its performance even slightly improves in the parameter. Most notable however is the increase in the range of the estimation results for the constant term of the Nuclear Norm Regularized estimators. While their average performance is a lot better than in the case of static factor models (error < 0.01 compared to error > 0.10), their range varies enormously when the value for χ is increased. The values are summarized in Table 3. The improvement in the estimate thus comes at the cost of an increase in the estimates volatility. Detailed results including standard deviations are available in the appendix.

Mean Min Median Max

NNmin χ = 0 0.8972 0.8307 0.8975 0.9569 χ = 0.4 1.0004 0.7678 1.0009 1.2255 χ = 0.8 1.0021 0.2249 0.9992 1.9319 NNpen χ = 0 0.8381 0.7761 0.8376 0.8861 χ = 0.4 1.0016 0.6002 0.9994 1.4528 χ = 0.8 1.0014 0.0596 0.9990 2.4827

(32)

When increasing the parameter φ while keeping χ constant at 0.5, the same happens only at a stronger magnitude as can be seen in Figure 11b and in the results given in the appendix. The standard deviations for the variable parameter estimates ˆβ1do not vary for the different parameter

initializations and are of low magnitude. Even though the Nuclear Norm Regularized estimates for the constant term display high volatility, for any level of the parameters the estimates converge to the true global minimum within two steps. The results thus indicate that the new estimators can safely be applied to both static and dynamic factor models when combined with the Post Estimation routine.

5.7 Lagged Dependent Variable

● ● ● 0.4 0.5 0.6 0.7 0.8 0.05 0.10 0.15

Value xi RMSE in Estimate ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● POLS LS NNmin NNpen POST1 POST2 POST3 OSLS

Figure 12: Plot of the behavior of the RMSE in the different estimates for an increasing level of the parameter ξ.

Nuclear Norm Regularized Estimation of Interactive Fixed Eﬀects Panel Regression Models: a Monte Carlo Approach

Nuclear Norm Regularized Estimation of Interactive

Fixed Effects Panel Regression Models: a Monte

Carlo Approach

Nuclear Norm Regularized Estimation of Interactive

Fixed Effects Panel Regression Models: a Monte

Carlo Approach

Lucian B. Franzky

July 2019

Contents

1

Introduction

2

Literature Review

3

Estimators

3.1

Pooled Ordinary Least Squares

3.2

Least Squares

3.3

Nuclear Norm Minimization

3.4

Nuclear Norm Penalization

3.5

Post Nuclear Norm Estimation

3.6

Estimators for the Number of Factors

4

Simulation Setup

4.1

i.i.d. Errors and Factors

4.2

Cross-Sectionally Correlated Errors in the Explanatory Variables

4.3

Autocorrelated Errors

4.4

Heteroskedasticic Errors

4.5

Autorelated Factor

4.6

Dynamic Factor Model

4.7

Lagged Dependent Variable

4.8

Cross-Sectionally Correlated Errors between the Dependent and

Explanatory Variables

4.9

Estimators for the Number of Factors

5

Results

5.1

i.i.d. Errors and Factors

5.2

Cross-Sectionally Correlated Errors in the Explanatory Variables

5.3

Autocorrelated Errors

5.4

Heteroskedasticic Errors

5.5

Autorelated Factor

5.6

Dynamic Factor Models

5.7

Lagged Dependent Variable