The balanced limited information maximum likelihood estimator.

(1)

The balanced limited information maximum

likelihood estimator.

Pawel Wilk

(2)

Master's Thesis Econometrics, Operations Research & Actuarial Studies Specialization: Econometrics

Faculty of Economics and Business University of Groningen

(3)

The balanced limited information maximum

likelihood estimator.

Pawel Wilk

August 10, 2012

Abstract

(4)

Introduction

The presence of heteroskedasticity of unknown form is a frequent setting in mi-croeconometric research, especially when many-instruments are used to improve eciency. Thus it is common practice to correct this heteroskedasticity. There-fore, a number of estimators that deal with heteroskedasticity can be found in literature. Included to the group of the most important estimators should be con-tinuously updated GMM estimator given by Hansen, Heaton, and Yaron (1996), the grouping estimator of Bekker and van der Ploeg (2005), the HLIM and HFUL estimators of Hausman, Newey, Woutersen, Chao, and Swanson (2011) and the SJIVE estimator of Bekker and Crudu (2012). The HLIM, HFUL and SJIVE es-timators will be discussed in this thesis.

However, a considerable part of this thesis will be devoted to the LIML estima-tor, which is commonly called the limited information maximum likelihood or LIML estimation. The term limited information refers to the context of si-multaneous equations, where only prior restrictions on one equation have been used, i.e. no use is made of overidentifying or cross-equation restrictions that may be applied to the parameters matrix. As shown by Bekker and van der Ploeg (2005), generally LIML is inconsistent under many-instruments asymptotics with heteroskedasticity. However, Bekker and van der Ploeg (2005) pointed out that in a special case LIML can be consistent. This special case occurs when the projec-tion matrix used in LIML formulaprojec-tion, that projects endogenous observaprojec-tions on the space spanned by instruments, has equal diagonal elements.

(6)

Hausman et al. (2011) and Bekker and Crudu (2012).

(7)

1 Various ways of obtaining a projection matrix

with a balanced diagonal

In this section I will describe and give solutions of a algebraic problem, which will be applied in the next section to a instrumental variable estimator under het-eroskedasicity. This algebraic problem amounts to obtaining a projection matrix with equal diagonal elements. Such a projection matrix will be called a projection matrix with a balanced diagonal. In order to dene this problem, I will introduce some assumptions about its structure, which are necessary to solve it. As it will be show, this algebraic problem can be reformulated as an optimization problem. Then, this reformulation will be used to design a few algorithms that will be used to solve this problem. To check whether these algorithms give desired results, they will be implement and then, some simulations of solution will be provided.

(8)

1.1 The notation, denitions and characteristics of the

al-gebraic problem

1.1.1 The notation

Let me start with dening a notation and a few basic operations, which will be used throughout this section. Namely, consider a nonzero, n × n, full column rank (f.c.r) matrix D with diagonal elements of a n-vector d with positive entries. Then, the operation of extracting diagonal elements from matrix D into vector d is diag(D) = d and the operation of creating the diagonal matrix D, with di-agonal elements of a n-vector, is Diag(d) = D. Also denote the matrix direct sum as Ln

i=1Ai, which is a block diagonal matrix with elements A1, ..,An on the

diagonal and let ◦ denotes the elementwise (Hadamard) product. Furthermore, let ın be the n × 1 vector of ones, In be the n × n identity matrix and Eii the

n×nmatrix with i-th entrance on the diagonal equal to 1 and elsewhere equal to 0.

1.1.2 Problem denition

In order to present the problem, I will rst introduce the denition of the projection matrix with a balanced diagonal, where the balanced diagonal means diagonal which has equal elements. Therefore, let Z be a nonrandom, n × k, f.c.r. matrix and PDZ = DZ(Z0D2Z)−1Z0D be the projection matrix of projection onto the

space spanned by vectors given in the columns of the matrix DZ. For the sake of convenience, let p(d) = diag(PDZ).

Denition 1. A projection matrix PDZ, which satises p(d) = knın, will be called

a projection matrix with a balanced diagonal.

Notice that in Denition 1 it was stated, that for any projection matrix with a balanced diagonal, diagonal elements are equal to k

n. This statement is made

be-cause tr(PDZ) = k and since all elements of the diagonal of PDZ have to be equal,

(9)

Having this denition, the algebraic problem that is the subject of this section, can be introduced. Let Z be dened as above, then the task is to nd such a ma-trix D that the projection mama-trix PDZ is the projection matrix with the balanced

diagonal. In other words, it can be said, that one is looking for a matrix D, which scales rows of Z in such a way that the projection matrix PDZ will have equal

diagonal elements.

Notice that the problem has no solution if the matrix Z is a block diagonal matrix, and the ratio of the number of columns to the number of rows of each block diers between blocks. To observe this, assume that the matrix Z is a block diagonal matrix, which can be described in the following way

Z = Zn1,k1 ⊕Zn2,k2 ⊕ ... ⊕Znl,kl =

l

M

i=1

Zni,ki,

where ni and ki denote the number of rows and columns of each block Zni,ki, and

l is the number of blocks on the diagonal of the matrix Z, such that Pl_i=1ni = n

and Pl

i=1ki = k. Then, the projection PDZ is also the block diagonal matrix and

can also be expressed as a matrix direct sum

PDZ =PDZ_n1,k1 ⊕PDZ_n2,k2 ⊕ ... ⊕PDZ_nl,kl = l

M

i=1

PDZ_ni,ki,

and therefore, assuming that all PDZni,ki are already diagonally balanced, the

diagonal of PDZ is following p(d) = diag(PDZ) = diag l M i=1 PDZ_ni,ki ! =    diag(PDZn1,k1) ... diag(PDZ_nl,kl)   =     k1 n1ın1 ... kl nlınl     .

Then, the matrix PDZ can be diagonally balanced only if nk11 =

k2 n1 = ... = kl nl = k n.

Therefore, the problem cannot be solved when the ratio of the number of columns to the number of rows of the i-th block i.e. ki

ni diers from this ratio for the whole

block diagonal matrix i.e. k n.

1.1.3 Derivative of a projection matrix

(10)

derive this derivative. It should be clear that, if PDZ is considered as a function

of di, this function is continuous, since di is assumed to be positive and (Z0D2Z)

is nonsingular.1 _{Therefore, the derivative can be derived and it as follows}2

∂P_DZ ∂di = ∂(DZ) (Z 0_D2_Z)−1_Z0_D ∂di =EiiZ(Z0D2Z)−1Z0D + DZ ∂(Z0D2Z)−1Z0D ∂di =EiiZ(Z0D2Z)−1Z0D + DZ(Z0D2Z)−1Z0Eii +DZ∂(Z 0_D2_Z)−1 ∂di Z0_D =EiiZ(Z0D2Z)−1Z0D + DZ(Z0D2Z)−1Z0Eii − 2DZ(Z0D2_Z)−1_Z0_DE iiZ(Z0D2Z)−1Z0D =DZ(Z0D2Z)−1Z0Eii+ (In− 2DZ(Z0D2Z)−1Z0D)EiiZ(Z0D2Z)−1Z0D. =DZ(Z0D2Z)−1Z0Eii+ (In− 2PDZ)EiiZ(Z0D2Z)−1Z0D. (1) From this equation, it is obvious that ∂P_DZ

∂di exists for every di > 0, as long as

(Z0D2Z)−1 is nonsingular. Also notice, that since the i-th element of the diagonal of PDZ is smaller than or equal to one i.e. {PDZ}ii ≤ 1, it holds true that

n ∂P_DZ ∂di o ii > 0. That is to say, EiiZ(Z0D2Z)−1Z0D + DZ(Z0D2Z)−1Z0Eii ii= 2 EiiZ(Z0D2Z)−1Z0D _ii ≥ 2 PDZEiiZ(Z0D2Z)−1Z0D _ii, and therefore ∂P_DZ ∂di ii =DZ(Z0D2Z)−1Z0Eii+ (In− 2PDZ)EiiZ(Z0D2Z)−1Z0D _ii =2DZ(Z0D2Z)−1Z0Eii− 2PDZEiiZ(Z0D2Z)−1Z0D _ii ≥ 0. (2) Result (2) is quite relevant and will be used to formulate solutions of the prob-lem. Mostly because it gives information about the change of the i-th element of

1_{However, in general this function does not have to be dened. It is possible that if a matrix}

(Z0D2Z) is close to the singular matrix, the change of value of di can make a matrix (Z0D2Z)

singular and therefore PDZ will not be dened function. Nevertheless, such a situation is vary

rare i.e. (ZD2_{Z) has to be close to the singular matrix.}

(11)

p(d) when the value of di is changed, and therefore the direction of the change

is known without the need of computing the value of a derivative. Precisely it gives information, that when the element di is increased, the value of i-th element

of p(d) also increases. This knowledge is quite important, since computations of many derivatives, which for large n are time consuming, can be avoided in cases when one is interested only in the sign of the derivative.

Additionally, there is one more helpful feature which can be observed. Notice that since tr(PDZ) = k and n ∂PDZ ∂di o ii

> 0, there exists at least one j ∈ {1, ..., n}, such that n∂P_DZ

∂di

o

jj < 0. This means, that by increasing the i-th element of d, there

exists at least one element of the diagonal of matrix PDZ say p(d)j, of which the

value will be decreased.

1.1.4 The existence of a unique solution of the problem

An important issue related to this problem, which can be helpful to solve it, is the question whether or not the problem has the unique solution. In general, it might be easier to nd a solution for a problem when there are many equivalent solutions, than when only one unique solution exists. Therefore, it would also be easier to solve our problem, if information of the existence of many solutions could be used, and in particular, the knowledge of the relationship between these solutions. As I will show, in case of our problem innitely many equivalent solutions exists. To obtain this result, I will show that for the projection matrix PDZ there exist

innitely many D for which

p(d) = k nın.

To show this, let assume that for Dα the projection matrix PDαZ is the projection

matrix with a balanced diagonal i.e. p(dα) = diag(PDαZ) =

k

nın. Then, any

matrix Dγ = γDα, where γ ∈ R+, also gives the projection matrix PDγZ with a

(12)

Therefore, it can easily be observed that the value of i-th element of diag(Dγ), is

not important in itself, until the ratio of i-th element of diag(Dγ) to every other

element of diag(Dγ), is the same as the corresponding ratio in the solution Dα.

Therefore, any scalar multiple of vector d, which is a solution, is also a solution. This observation will be used later, in one of the solutions (Algorithm 4).

1.2 Problem solution

So far I presented the denition of the problem and some basic features of it. In this section I will describe ve methods of obtaining matrix D, for which PDZ

is a projection matrix with a balanced diagonal. All of these methods give the solution, however, as it will be shown in the Section 1.3, the speed of convergence diers strongly between them.

The rst thing that one should notice, in order to solve this problem, is that the problem can be written as a system of nonlinear equations of which the solution should satisfy p(d) = k

nın. Therefore, let me describe this system of nonlinear

equations in the following way

d = argsolve d p(d) = k nın ,

and let assume that the solution of this problem exists. Then, in order to solve this system, I will minimize the dierence between elements of p(d) and the desired value of a diagonal element k

n, which can be described in terms of an optimization

problem. Therefore, to nd d, it is sucient to solve the corresponding minimiza-tion problem d = argmin d p(d) − k nın 0 p(d) − k nın (3) or equivalently d = argmin d (p(d)) 0 (p(d)) . ₍₄₎

Both (3) and (4) are minimized for p(d) = k

nın. In case of (3) the minimum value

is 0 and for (4) the minimum value is equal to k2

n.

(13)

will be possible to look for a feasible solution, which will minimize these objective functions and thus will be a solution of the minimization problems. Therefore, let denote by S1(d) the objective function corresponding to (3)

S1(d) = p(d) − k nın 0 p(d) − k nın , (5)

and by S2(d) the objective function, which corresponds to (4)

S2(d) = (p(d))0(p(d)) . (6)

It is straightforward to notice that the minimal value of S1(d) and S2(d) are

ob-tained for d = k

nın. It is because S1(d) is a sum of squares of dierences between

elements of p(d) and k

n, and S2(d) is a sum of squares of elements of p(d).

Both of these objective functions can be equivalently used in algorithms solving the problem and either the minimization of S1(d) or S2(d) will give the same

results. However, to make the presentation of algorithms consistent, I will use only function S1(d).

1.2.1 Algorithms solving the optimization problem

(14)

4, this framework will be slightly modied.

General framework of algorithms

In this subsection I will present the general framework of algorithms, which will give the structure of Algorithms 1, 2, 3, and 5. Therefore, let consider the follow-ing procedure, which will be referred to as General Algorithm.

Let G(x) be a dierentiable, Rn _{→ R}n _{function, for which there exists a solution}

of the following problem

x0 =argsolve

x {G(x) = y} , (7)

where x is a n × 1 vector of variables and y is a n × 1 vector of values. Then, let me dene the objective function F (x), which is given by

F (x) = (G(x) − y)0(G(x) − y), (8) and which will be minimized. Notice that the minimum of (8) is equal to 0 and the vector x for which this minimal value is obtained, is a solution of (7). Ad-ditionally notice that the function F (x) is just a sum of squares of the dierence between the value of the function G(x) and the desired value y. Therefore, the ob-jective function F (x) is dierentiable with respect to x, since G(x) is dierentiable. To minimize the sum of squares given by (8), the vector x is updated in each iteration to the new vector x + δ, where the n × 1 vector δ is the updating vector. This updating vector δ gives both the direction of minimization and the magnitude of update. Thus, in each iteration, the value of the objective function F (x) will decrease after updating vector x. This updating procedure is repeated, until the value of the objective function will be small enough i.e. updating is repeated until

F (x) = (y − G(x))0(y − G(x)) < ε,

(15)

Algorithm 1

The rst algorithm is based on the well-known gradient descent method of Sny-man (2005), where one is minimizing or looking for a specic value of an objective function. The procedure of this algorithm will be like in the General Algorithm, where I will apply a specic way of determining a vector δ, by using the gradient descent method. For sake of convenience, let me denote the number of the current iteration of an algorithm as l, for example, xl denotes the vector x in the l-th

iteration of an algorithm.

As I already mentioned, let the whole procedure of Algorithm 1 be like in the General Algorithm. Then, the only remaining task is to determine the updating vector δl_{, which will update a vector x}lin each iteration. To do so, rst observe that

the objective function F (xl₎is dierentiable. Thus, from any point xl, the greatest

decrease of a value of F (xl₎is obtained, when xlis updated in the opposite direction

than given by the gradient of F (xl₎. Therefore, let the gradient of function F (xl₎

be denoted as ∂F (xl₎

∂x , that is n × 1 vector. Then, the updating vector from the

point xl to xl+1, δl_{, can be given as follows}

δl = −γl◦ ∂F (x

l₎

∂x , (9)

where ◦ denotes Hadamard product, γl_{is a n × 1 vector of parameters responsible}

for the size of the update and the gradient ∂F (xl₎

∂x gives the direction of updating.

One should note that the i-th element of δl _{is just a product of i-th element of γ}l_,

which scales the size of the update and the i-th element of −∂F (xl₎

∂x , which gives

the direction of it. Then, it would be reasonable to set γl_{, as dependent on the}

dierence between the actual value of G(xl₎ _{and the vector y, since it will adjusts}

the size of the update to the distance between the actual solution and the desired one. Thus, let γl_{be linearly dependent on the dierence between G(x}l₎_{and y and}

let it be given by

γl = α G(xl) −y , (10)

where α > 0 is just an arbitrary chosen scalar, constant between iteration. Then, following the General algorithm, the updating procedure is given by

(16)

After updating vector xl to the vector xl+1, the new value of F (xl+1₎is computed.

Then, if condition

F (xl+1) < ε, (12)

is not fullled, the procedure of updating, with new values of δl+1 _{and x}l+1 _is

repeated. Of course, the new updating vector δl+1 _{is computed as previously i.e.}

as given in (9). The nal result is obtained, when the condition (12) is fullled. However, in general this procedure does not guarantee that the global minimum of F (x) will be obtained, since the algorithm can get stuck in a local minimum. It can happen when the value of ∂F (xl₎

∂x will be zero in the dierent point than the

global minimum.

Application of Algorithm 1

In order to solve the main problem I will apply the above algorithm to the min-imization problem given by (3). Therefore, as an objective function F (x) I use the function S1(d) i.e. x = d and F (x) = S1(d). As I pointed out before, S1(d)

is minimized when p(d) = k

nın and its minimum is 0, therefore the minimization

of S1(d) will give the desired solution of the problem. Also, let G(x) = p(d) and

y = k

nın. Then, by applying these values, the general minimization problem given

by (7) can be specied as d0 =argsolve d p(d) = k nın .

To solve it, let me use the following updating procedure, which is based on (11), dl+1 ₌_dl_{+ δ}l

. (13)

The updating vector δl _{is constructed as in (9)}

δl = −γl◦∂S1(d

l₎

∂d , where ∂S1(dl)

∂d is just a derivative of the objective function with respect to d i.e.

(17)

and where Jl _{is n × n matrix, called the Jacobian of function p(d}l₎_{, where an ij-th}

element of J is a j-th diagonal element of the matrix dened in (1), i.e. the matrix J is as follows J = ∂p(d) ∂d =    {∂P_DZ ∂d1 }11, { ∂P_DZ ∂d2 }11, . . . , { ∂P_DZ ∂dn }11 ... ... {∂PDZ ∂d1 }nn, { ∂P_DZ ∂d2 }nn, . . . , { ∂P_DZ ∂dn }nn   .

The vector γl_{, which scales the size of the update, corresponds to (10) and it is}

given by γl= α p(d)l₋ k nın ,

where α ∈ R+ _{is some arbitrary chosen constant. Notice that i-th element of γ}l

depends linearly on the dierence between i-th element of p(dl₎_{and desired value} k

n. Thus, γ

l _{scales an inuence of a gradient of S}

1(d) in the updating of d. When

this dierence is big, the value of d changes signicantly in the opposite direction to given by the gradient, while for a small dierence, the changes of d are small. Therefore, when actual d is far from the point where it minimizes S1(d) the change

of its value is large, which increases the speed of the convergence. While, when d is close to this minimizing point the update is small, which makes the step more precise and which is good at nal iterations of the algorithm.

Finally the updating procedure given by (13), can be rewritten as dl+1 =dl− γl◦ ∂S1(d l ) ∂d =d l_{− α} p(d)l₋ k nın ◦ ∂S1(d l ) ∂d .

The iteration are repeated until the value of the objective function (5) satises S1(dl) = p(d)l₋ k nın 0 p(d)l₋ k nın < ε,

where ε ∈ R+ is chosen as a small number. This algorithm converges in two

sit-uations, when the gradient ∂S1(dl)

∂d or p(d) l₋ k

nın converges to the vector of zeros

or at least one of them is close to a vector of zeros. When the gradient ∂S1(dl)

∂d

becomes the vector of zeros, it means that there is no direction, from point d, in which the function S1(d) could be minimized. Therefore, if simultaneously also

the vector p(d)l₋ k

(18)

minimum of function S1(d) has been obtained. Otherwise, when p(d)l− k_nın is

signicantly dierent from vector of zeros i.e. S1(dl) > ε, then the algorithm has

found a local minimum of S1(d), dierent from the global minimum. In that case

the computations could be repeated with a dierent starting value. However, in practice I did not nd such.

Algorithm 2

This algorithm is based on the Levenberg-Marquardt algorithm Madsen and Nielsen (2004), which is an iterative procedure of the minimization, based on the Gauss-Newton method and the gradient descent method, already used in Algorithm 1. As in the previous algorithm, I will use the General Algorithm as a framework. The notation in following algorithm will be also like in General Algorithm. The updating procedure in the General Algorithm is based on changing value of xl by adding vector δl_{. Therefore, let me show how the δ}l _{is determined in every}

step of Algorithm 2. To calculate δl_{, rst I will use a linear approximation of the}

function G(xl₎for the argument xl_{+ δ}l_,

G(xl_{+ δ}l_{) ≈}_G(xl_{) +}_Jl_δl_, ₍₁₄₎

where Jl

= ∂G(x_∂_xl) is n × n matrix, called the Jacobian of the function G(x). By applying this approximation to (8), one obtain an approximated sum of squares for argument xl_{+ δ}l_{, namely the approximation of objective function F (x}l_{+ δ}l

), which is as follows

F (xl+ δl) ≈ (y − G(xl) −Jlδl)0(y − G(xl) −Jlδl). (15) As before, the goal of this algorithm is to minimize the objective function F (xl₎_,

therefore it is convenient to use an approximation of F (xl₎ _{given by (15), to nd}

this minimum. Firstly, one should notice that, if F (xl₎ _{is minimized, then the}

gradient ∂F (xl₎

∂δ = 0ın, since there is no direction in which the function F (x l₎ _can

be minimized. Thus, the same logic can be applied to the approximation of F (xl₎_.

Therefore, taking a derivative of (15), which is the following ∂F (xl)

∂δ = −2J

0l

(19)

and setting it to the vector of zeros, the following result is obtained J0l

(y − G(xl)) =J0lJlδl. (16) The expression (16) is just a system of linear equations in δl_{, from which a}

solu-tions can easily obtain. This vector δl _{would give the size and the direction of an}

update, and (15) would be in minimal.

However, instead of using (16), the Levenberg-Marquardt algorithm uses J0l

(y − G(xl)) = (J0lJl+ λlIn)δl,

or equivalently

δl = (J0lJl+ λlIn)−1J0l(y − G(xl)), (17)

where λl _{is a 'damping factor', that is adjusted in each iteration. By introducing}

this 'dumping factor' λ, the algorithm can be brought close to the Gauss-Newton algorithm or the gradient descent algorithm, depending on the value of λ. However, rst notice that for every λl _{> 0} _{the matrix (J}0l_Jl

+ λlIn)is positive denite, thus

δl gives a descent direction. Then, the value of the parameter λl is updated in each iteration. It depends on the size of the decrease of the function F (xl₎ _{in the}

last step

λl = ω(F (xl−1) − F (xl)) = ω(F (xl) − F (xl+ δl)), (18) where ω ∈ R+ _{is some chosen parameter, held constant between iterations of the}

algorithm. If the dierence F (xl−1_{) − F (}_xl₎ _{is large, then from (18) a large λ}l _is

obtained, and the matrix (J0l_Jl

+ λlIn)is nearly diagonal. Thus, the resulting δl is

close to the largest descent direction, which is good if the solution is far from the minimum. For small values of F (xl−1_{) − F (}_xl₎_{, λ}l _{is small which gives δ}l _similar

to obtained by the Gauss-Newton method and thus it is a good step in the nal steps of the algorithm.

Like in the General Algorithm, the updating procedure xl+1 ₌_xl_{+ δ}l_,

(20)

In order to apply the General Algorithm with updating procedure of δl _{as given}

in (17), one shall set F (x) = S1(d), G(x) = p(d), where x = d and y = k_nın. Then

the nal formulation of Algorithm 2 applied to the problem is as follows. To nd a solution for d = argsolve d p(d) = k nın , let update d in every iteration. Thus, rst let λl _{be given by}

λl = ω(S1(dl−1) − S1(dl))

where the value ω > 0 is arbitrary chosen. For l-th step, let the new value δl _be

given as in (17) δl = (J0lJl+ λlIn)−1J0l k nın−p(d l₎ , and where the Jacobian Jl _{is as follows}

J = ∂p(d) ∂d =    {∂P_DZ ∂d1 }11, { ∂P_DZ ∂d2 }11, . . . , { ∂P_DZ ∂dn }11 ... ... {∂PDZ ∂d1 }nn, { ∂P_DZ ∂d2 }nn, . . . , { ∂P_DZ ∂dn }nn   .

The ji-th entry of Jacobian, {∂P_DZ

∂di }jj is j-th diagonal element of the matrix

∂P_DZ ∂di given by 1 i.e. ∂P_DZ ∂di jj =DZ(Z0D2Z)−1Z0Eii+ (In− 2PDZ)EiiZ(Z0D2Z)−1Z0D _jj.

Then, the updating procedure of the vector dl _{is given by}

dl+1 ₌_dl_{+ δ}l_.

(21)

Algorithm 3

In Algorithm 2 I used (14) to obtain linear approximation of the function G(xl₎

for argument xl_{+ δ}l_{. However, since J}l _{used in (14) is an n×n matrix, calculating}

it in every iteration of an algorithm can be very time consuming. Therefore, it could be useful to use an approximation that will be faster, nonetheless, such an approximation will probably be less accurate.

Let the notation be like in the General Algorithm and Algorithm 2. Recall F (xl₎

as given by (8) and as before notice that F (xl₎ is minimized when

y = G(xl_).

In order to construct an alternative approximation of G(xl₎, let me rst introduce

an n × 1 vector cl, given by

cl ₌_diag(Jl

),

where Jl _{is the Jacobian matrix of the function G(x}l₎_{. Notice that c}l _{is much}

faster to calculate, since it has n(n − 1) less elements than matrix Jl_{. However it}

also includes much less information than the matrix Jl_{, since entries which are not}

on the diagonal of J are omitted.

Now, let the approximation of function G(xl₎ _{for argument x}l _{+ δ}l_{, which uses}

vector cl_{, be given in the following way}

G(xl

+ δl) ≈G(xl) +cl◦ δl. (19) The dierence between this approximation and the one given by (14) is the usage of cl _{◦ δ}l _{instead of J}l

δl. As I mentioned before, the vector cl does not contain other values than those on the diagonal of Jl_{. Therefore, if the elements in the}

i-th row and j-th column of Jl, i.e. {Jl}ij and i 6= j, are signicantly dierent

from zero, then the approximation of the value of G(xl_{+ δ}l

) given by (19), can be distinctly dierent from the value given by (14). However, if {Jl_}

ij is small, this

approximation should be close to obtained by (14). Particularly, when {Jl_} ij = 0

(22)

Having the approximation given by (19), I will use a similar procedure as in Al-gorithm 2, to derive an adequate vector δl_{, which will update a vector x}l in each

iteration. Therefore, let the algorithm solve the following problem in each iteration δl =argsolve

δ

G(xl_{) +}_cl_{◦ δ}l

=y . (20)

The solution of (20) is straightforward and given by δl = y − G(xl) ◦hl,

where hl_{is an n vector with an i-th element equal to the inverse of the i-th element}

of cl _{i.e. h}l

i = (cli)

−1_{. By using this δ}l_{, the updating of x}l _{is the following}

xl+1 ₌_xl_{+ δ}l

,

and this updating should be iterated until the desired level of accuracy is reached i.e. F (xl_{) < ε}_.

In order to apply this algorithm to the problem, let set F (xl_{) = S}

1(dl) given by

(5), G(xl_{) =}_{p(d) and y =} k

nın. First, let me denote c l _as cl =diag(Jl) =    {∂PDZ ∂d1 }11 ... {∂PDZ ∂dn }nn   , where, as before, {∂P_DZ

∂di }ii is the i-th element of the diagonal of the matrix

∂P_DZ ∂di .

I already noticed that the vector cl _{is much faster to calculate than Jacobian of}

p(d). However it include much less information, since the values of n∂P_DZ ∂di

o

jj,

for j 6= i, are omitted. Therefore, cl _{does not give any information how the i-th}

diagonal element of PDZ changes, when the j-th element of dl has been changed.

Having this above values, the procedure of Algorithm 3 is following. In each iteration the algorithm solves the following problem

(23)

for which the solution is found by δl = k nın−p(d l₎ ◦hl_,

where i-th element of hl _{is equal to inverse of i-th element of c}l i.e. hl i =

{∂PDZ

∂di }ii

−1

. Then the value of the vector dl _{is updated in the following way}

dl+1 ₌_dl_{+ δ}l

.

As in previous the algorithms, the procedure of updating vector dl _{should be}

re-peated until S1(dl) < ε.

Algorithm 4

In contrast to the previous algorithms, the procedure of updating vector dl _in

Algorithm 4, will slightly dier from the procedure given by General Algorithm. In this case, instead of adding vector δl _{to vector x}l_{, where δ}l _{is the direction of}

minimization of F (xl₎_{, Algorithm 4 will increase the values of x}l _{until the function}

G(xl_{) ≈}_{y. Since this algorithm has been specially created for this problem, I will}

not present a general approach as previously, but I will straightforwardly provide the application of it.

This algorithm is based on the observation that for the projection matrix PDZ =

DZ(Z0_D2_Z)−1_Z0_{D there exist innitely many D for which the diagonal of P} DZ,

p(d), fullls

p(d) = k nın.

This fact was already proven in Section 1.1.4, where I showed that for any solu-tion, the value of the i-th element of dl _{is not important until the ratio of the i-th}

element to every other is the same as in any correct solution. In other words it can be said that the scalar multiple of any correct solution, i.e. dl_{, is also a solution.}

In Subsection 1.1.3 it was shown that n∂p(d) ∂di

o

ii > 0 and tr(PDZ) = k, thus there

is at least one j 6= i for which n∂p(d) ∂di

o

jj

(24)

will be increased, then the ratio of {p(d)}i to at least one element {p(d)}j will

in-crease. However, in practice it usually holds for most of j. Therefore by increasing element di the ratio of {p(d)}i to most of the remaining elements of p(d) will be

increased. Therefore, let me construct an algorithm which in each iteration will in-crease elements of dl_{, until the ratio of those elements will give the correct solution.}

Such an algorithm is given by following procedure. Let the objective function, be given by (5), i.e. S1(dl) = p(dl ) − k nın 0 p(dl ) − k nın .

Then, the procedure of updating vector dl _{in l-th iteration is the following}

dl+1 ₌_dl_{◦ β}l_, ₍₂₁₎

where βl_{is a n×1 vector, which scales elements of d}l_{in each iteration. This scaling}

is based on the observation made above, that by increasing dl

i, the ratio of p(d l₎

i

to most of the remaining elements of p(dl₎ _{will also be increased. Therefore, let}

me set the vector βl_{, that will increase the values of elements of d}l_{, until the vector}

p(dl₎ _{will have equal elements or at least p(d}l_{) ≈} k

nın. Such a vector can be given

by βl =     p(dl₎ 1 λ ... p(dl ) _nλ     , (22)

where λ ∈ (−1, 0). The notation in (22) can be a little bit confusing, however the vector βl _{is just equal to the vector p(d}l

), where every element of p(dl)is rise the power of λ. Notice that since λ ∈ (−1, 0), the elements of βl _{are greater than one.}

Then, by applying βl _{to the updating procedure given by (21), one will obtain the}

value of every element of dl+1_{, which fullls d}l+1 i ≥d

l

i, since β l i ≥ 1.

The updating procedure of the vector dl _{given by (21) converges (which will be}

(25)

is fullled.

Now, let me show that the updating procedure (21), which use βl _{given by (22)}

converges to the desired solution. First, notice that if p(dl₎

i ≥ p(dl)j, for i 6= j,

than the value of dl

j grows faster than d l i i.e. dl+1 j −d l j dl j ≥ d l+1 i −d l i dl i , (23) since βl j ≥ β l

i. This is always true, since it follows directly from the updating

procedure given by (21). Then, there are two possibilities of change of the values p(dl₎

i and p(dl)j between iteration l and l + 1. First, if

( ∂p(dl) ∂dl_j ) jj ≥ ∂p(d l₎ ∂dl_i ii ≥ 0, (24)

then it is obvious that p(dl

)j increases faster in respect to dlj, than p(d l

)i increases

in respect to dl

i. Therefore, since the the increase of d l

j is greater than increase

of dl

i (inequality (23)), the increase of p(d l

)j is surely bigger than the increase of

p(dl )i i.e. p(dl+1 )j −p(d)lj p(dl₎ j ≥ p(d l+1₎ i−p(dl)i p(d)l i .

In this case the dierence between p(dl₎

i and p(dl)j will be smaller in the next

iteration, which is the desired result. If the property given by (24) would always be true, the dierence between p(d)i and p(d)j would vanish after some number

of iterations and the solution, p(d)i ≈p(d)j for every i and j would be obtained

relatively quickly. However, for p(dl₎

i ≥p(dl)j the property (24) does not always

have to be true.

An other situation is possible, when for p(dl₎

i ≥p(dl)j it follows that 0 ≤ ( ∂p(d) ∂dl_j ) jj < ∂p(d) ∂dl_i ii .

Thus, the increase of p(d)j in respect to dlj is no longer faster than the increase

(26)

grows faster than dl

i, which was given by (23), the increase of the value of p(d)i

will be bigger than the increase of p(d)j i.e.

p(d)l+1 j −p(d)lj p(d)l j < p(d) l+1 i −p(d)li p(d)l i .

In that case the dierence between values of p(d)i and p(d)j in the following

it-eration will increase. It would seem that this is an undesired outcome, however, notice that if the value of p(dl+1₎

i > p(dl)i, then βl+1i < β l

i, which means that

the increase of dl

i will get smaller after iteration l + 1. This is just a consequence

of the updating procedure given by (21), which sets dl+2 i = β

l+1 i d

l+1

i . Similarly in

case of p(d)j we have p(dl+1)j <p(dl)j and then βl+1j > β l

j, which is to says that

the increase of dl

j will be bigger after iteration l + 1. Therefore, if the property

given by (24) will hold in subsequent iterations, then the increase of p(dl₎

i will get

smaller through iterations, while an increase of p(dl₎

i will get bigger. Then, after

some number of iterations βi should be small enough and βj big enough to make

the increase of p(d)j bigger than the increase of p(d)i and therefore it will decrease

the dierence between values of p(d)i and p(d)j. Therefore, this dierence will

converge to 0 or at least very small value, and thus the values of elements dj and

di will also converge to the desired result.

Algorithm 5

The last algorithm is a combination of procedures used in the General Algorithm and Algorithm 4. In this algorithm, similarly like in Algorithm 4, I will use the knowledge thatn∂p(d)

∂di

o

ii > 0. However, instead of increasing each di until the

al-gorithm will converge, I will apply the information about the value of p(dl₎

i−k_n, to

decide about the direction of change of di. Therefore, it is convenient to construct

an algorithm which will increase the value of di when p(dl)i−_nk < 0 and decrease

it otherwise. Therefore, the updating procedure will be like in the General Algo-rithm, i.e. the actual vector dl _{will by updated by adding vector δ to obtain the}

(27)

Algorithm 5 is given by the following procedure. As in the case of Algorithm 4, the algorithm will minimize the objective function given by (5) i.e.

S1(dl) = p(dl ) − k nın 0 p(dl ) − k nın .

Then, knowing from Subsection 1.1.3 that n∂p(dl) ∂di

o

ii

> 0, it is obvious that by increasing the value of the element di the value of p(d)i will also increase. Then,

by using an information about the actual value of p(dl₎

i in l-th iteration and by

computing the dierence between this value and the desired one i.e. k

n, it can be

easily checked, whether the value p(d)i should be increased or decreased. Thus, if

p(dl₎

i−k_n > 0then the actual value of p(dl)i is bigger than desired. Therefore, the

value of di should be decreased in order to decrease the value of p(dl)i, whereas

when p(dl₎

i− k_n < 0, the value of di should be increased.

Now, let me introduce the updating procedure which is based on the above logic. Let the vector δl _{gives a direction of minimization and a size of update in}

l-th iteration of l-the algoril-thm. Then, as in l-the General Algoril-thm, l-the updating procedure of the vector dl _{is the following}

dl+1 =dl+ δl, (25) where δl _{is given by} δl = −dl◦ p(dl ) −k nın . (26)

Of course, as before the updating step given by (25) should be repeated, until the value of the objective function will be small enough i.e. S1(dl) < ε.

Let me show that the updating procedure, which uses δl _{given by (26), converges}

to the correct solution. First, notice that every element of δl _{is a product of the}

two elements dl _{and p(d}l_{) −} k nın

. The rst element dl_{, gives the magnitude of}

the update i.e. if dl

i is large, then the change of its value will also be large and if

dl

i is small, than the change will be small. This ensures that the change of each

element of the vector dl_{, will be adjusted to the actual value of d}l_{. The second}

one p(dl_{) −} k nın

(28)

Firstly, the direction of the change depends on the sign of p(dl_{) −} k nın

. For example, if the i-th element of p(dl_{) −} k

nın

is positive, then the i-th element of δl _{will be negative and the value of d}

i will be decreased. Then, by updating

the procedure given by (25), the i-th element of p(dl+1₎ _{will get smaller, since}

n

∂p(d) ∂di

o

ii

> 0 and this is a desired result, since initially the i-th element of p(dl) was too large. Secondly, the magnitude of the change depends on the absolute value of i-th element of p(dl_{) −} k

nın

. If an absolute value of i-th element of p(dl_{) −} k

nın

is large, then it means that the actual value of dl

i is far from the

desired solution and the value of dl

i should also be signicantly changed. This

is done, since the value of dl

i in (26) is scaled by i-th element of p(d l_{) −} k

nın

, which has large absolute value and therefore the i-th element of δl _{should be also}

large. Additionally, notice that if the value of dl

i comes close to the solution, the

value of i-th element of p(dl_{) −} k nın

gets smaller, and nally when the solution is obtained the value of i-th element of p(dl_{) −} k

nın

is zero and therefore also δl i is

zero (or at least when the actual solution is close to the desired solution the values p(dl_{) −} k

nın

and δl

i are close to 0). Then the procedure given by (25), does not

update di any more, i.e. the updating procedure converged for i-th element of

(29)

1.3 Simulations of the solution of the problem

In this section I will present the results of simulations, which are made in order to compare the algorithms presented in Section 1.2. Therefore, these simulations will give an example of solving the problem by each of the algorithms, for a few dierent specications of a n × k, f.c.r, matrix Z. The main attribute, which will be compared, is a time of obtaining a solution, assuming some level of accuracy of results. However, before comparing this, it will be checked whether the results of those algorithms are the same i.e. whether all algorithms give the same vector d as a solution.3

Every algorithm requires an initial value of vector d. Also, concerning the pos-sibility of comparing the algorithms, in particular the time of convergence of an algorithm, this initial value of vector d, should be the same for all algorithms in every simulation. Therefore, let the initial value of vector d, for every simulation, be equal to qk

nın. This value seems to be the best (i.e. the convergence of

algo-rithms is fastest) among values, which were examined.

1.3.1 Consistency of solutions

In the rst simulation, I will check whether the algorithms give the same solution. Therefore, I will perform a simulation where all algorithms will solve the problem, where the single block, n × k, matrix Z, will be used.

In order to make a presentation of the results clear, I will set the size of matrix Z as small, therefore let n = 30 and k = 10. Then, let the value of entrances of the matrix Z be i.i.d drawn from N (100, 15). To ensure that the results are precise, the stopping value ε is set as ε = 10−10_{. The outcomes of each of the algorithms,}

i.e. the values of d for which projection matrix PDZ has a balanced diagonal, can

be found in Table (1).

3 _{All algorithms were implemented by using Matlab R2011b for Linux platform. The}

(30)

No. of di Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 Algorithm 5 1 0.0357 0.0357 0.0358 0.0357 0.0357 2 0.0396 0.0396 0.0397 0.0396 0.0396 3 0.0385 0.0385 0.0385 0.0385 0.0385 4 0.0149 0.0149 0.0149 0.0149 0.0149 5 0.0328 0.0328 0.0329 0.0328 0.0328 6 0.0417 0.0417 0.0417 0.0417 0.0417 7 0.0390 0.0390 0.0391 0.0390 0.0390 8 0.0272 0.0272 0.0272 0.0272 0.0272 9 0.0423 0.0423 0.0424 0.0423 0.0423 10 0.0202 0.0202 0.0202 0.0202 0.0201 11 0.0288 0.0288 0.0288 0.0288 0.0288 12 0.0475 0.0475 0.0476 0.0475 0.0475 13 0.0338 0.0338 0.0339 0.0338 0.0338 14 0.0293 0.0293 0.0293 0.0293 0.0293 15 0.0464 0.0464 0.0464 0.0464 0.0464 16 0.0260 0.0260 0.0261 0.0260 0.0260 17 0.0223 0.0223 0.0223 0.0223 0.0223 18 0.0212 0.0212 0.0212 0.0212 0.0212 19 0.0329 0.0329 0.0329 0.0329 0.0329 20 0.0409 0.0409 0.0409 0.0409 0.0409 21 0.0512 0.0512 0.0512 0.0512 0.0512 22 0.0184 0.0184 0.0184 0.0184 0.0184 23 0.0201 0.0201 0.0201 0.0201 0.0201 24 0.0341 0.0341 0.0341 0.0341 0.0341 25 0.0431 0.0431 0.0432 0.0431 0.0431 26 0.0383 0.0383 0.0384 0.0383 0.0383 27 0.0468 0.0469 0.0469 0.0468 0.0469 28 0.0292 0.0292 0.0292 0.0292 0.0292 29 0.0250 0.0250 0.0249 0.0250 0.0250 30 0.0326 0.0326 0.0326 0.0326 0.0326

(31)

In order to compare the results given by the algorithms, the values of solution, i.e. vectors d presented in columns of Table 1, were normalized. From Table 1 can be noted that the results dier slightly. However, these dierences are so small that they can be omitted, in other words it can be said that after the normalization results given by each algorithm are identical. This statement seems to be reason-able, since the dierence between results is probably an eect of approximations and rounding during computations.

1.3.2 The speed of convergence of algorithms

Now, knowing that the results given by algorithms are the same, I will compare the speed of the convergence of algorithms. To do so, the simulations will be performed for a few dierent specications of a matrix Z. Firstly, I will check the speed of convergence for the rather standard specication of matrix Z i.e. matrix Z will be set as a block diagonal matrix, with l blocks of the same size and where the value of entries of each block are i.i.d, normally distributed. Then, I will consider the case of single block matrix Z, where the value of entries can be equal to 0 or 1. Also, for such a values of entries, I will investigate the convergence of algorithms when matrix Z is sparse or dense. Finally, I will perform some simulations to check whether it would be possible to obtain a satisfying results, when matrix Z will be a block diagonal matrix with unequal size blocks, however then, instead of the matrix Z, I will use an approximation of it.

The case of standard specication of the matrix Z

The rst simulation will be conducted for matrix Z consisting of l diagonal blocks Zi,i of the same size and elsewhere 0-s. Thus, matrix Z is a direct matrix sum,

given by Z = Z1,1⊕Z2,2⊕ ... ⊕Zl,l = l M i=1 Zi,i.

Let construct four matrices Z, where each of them consist of l = 10 matrices Zi,i,

(32)

Let these matrices be denoted as Za_{, Z}b_{, Z}c _{and Z}d_{. The size of each of these}

matrices is given in Table 2.

Matrix Size of the block Zi,i Size of whole Z

Za _{10 × 3} _{100 × 30}

Zb _{50 × 5} _{500 × 50}

Zc _{100 × 10} _{1000 × 100}

Zd _{300 × 20} _{3000 × 200}

Table 2: The table gives the size of matrices Za_{, Z}b_{, Z}c _{and Z}d_{. The second column}

gives the size of corresponding bloc Zi,i and the third column gives the size of the whole

matrix.

Also let me set the value of every entrance of matrices Zi,i as i.i.d drawn from

N (100, 15). To perform the simulation, I will set the the value of the stopping criterion as ε = 10−6 _{and the starting value of d as} qk

nın. Then, the results of

computations are given by Table (3).

Za _Zb _Zc _Zd Algorithm 1 11.788 455.77 1832 -Algorithm 2 1.6043 7.0081 493.9 2282 Algorithm 3 0.0752 1.0001 5.02 130 Algorithm 4 0.1782 0.2822 3.65 30 Algorithm 5 0.0682 0.0932 1.65 7

Table 3: Computation time. The time of convergence of Algorithms 1-5, for matrices Za_{, Z}b_{, Z}c _{and Z}d_{, when l = 10.}

As it can be seen from the table above, in all cases the algorithm that converges the fastest is Algorithm 5. Notice that for matrix Zd_{, Algorithm 1 did not converge}

at all, i.e. it is possible that Algorithm 1 would converge, however the execution of the algorithm was interrupted after a reasonable amount of time.

(33)

checking this particular case can be very important, since in most practical cases, matrix Z has such a structure. Therefore, let again construct four matrices Z i.e. Za_{, Z}b_{, Z}c _{and Z}d_{. Let the total size of Z's matrices be like in the rst simulation,}

i.e. let Za _{be 100×30, Z}b _{be 500×50, Z}c_{be 1000×100 and Z}dbe 3000×200. Also

let every entrance of these matrices be i.i.d drawn from N (100, 15). Like before, the starting value of vector d was set asqk

nın and the value of stopping criterion

as ε = 10−6_{. Then, the results of the computations are the following,}

Za _Zb _Zc _Zd Algorithm 1 2.4112 62.582 1504 -Algorithm 2 0.5949 6.64 100.9 2156 Algorithm 3 0.0249 0.5322 2.85 80 Algorithm 4 0.0248 0.2832 2.65 34 Algorithm 5 0.0148 0.0232 1.65 28

Table 4: Computation time. The time of convergence of Algorithms 1-5, for matrices Za_{, Z}b_{, Z}c _{and Z}d_{, when l = 1.}

Similar to the previous simulation, Algorithm 5 was the fastest and Algorithm 1 failed to converge for the matrix Zd_.

The case of binary entrance of matrix Z

The interesting case, also from a practical point of view, is when entrances of Z are realizations of the Bernoulli distribution with values 1 and 0.4 _{Therefore, I}

will examine whether the algorithms converge to the correct solution, and in cases they do, I will compare the time of convergence. Additionally, it will be checked whether solution can be obtained for sparse and dense matrices Z i.e. in this case, when most of the values of the entries of matrix Z is zeros (a sparse matrix) or ones (a dense matrix).

4_{As it will be presented in Section 2, in the application of these algorithms the matrix Z is}

(34)

Thus, let set Z as a one block matrix, which size is 100 × 10. Then, the value of the entrances of Z are realizations of Bernoulli(p), where p = 0.01, 0.5, 0.99. In case when p = 0.01, matrix Z is a sparse matrix, therefore to ensure that matrix Z is a full column rank, in each row a one entrance was randomly set as 1. Similar adjustments were also made for p = 0.99, where I randomly set one entrance in each row equal to 0. In practice, when p = 0.5 matrix Z is almost always f.c.r., therefore it was not adjusted as in case of p = 0.01 or p = 0.99. Then, as before, to run algorithms the starting value of d is set as equal toqk

nın and the stopping

criterion as ε = 10−6_{. The times of the computation are given in Table 5.}

p = 0.01 p = 0.5 p = 0.99 Algorithm 1 - 42.03 0.0461 Algorithm 2 4.793 0.444 0.0761 Algorithm 3 - 0.040 0.0612 Algorithm 4 - 0.043 0.0511 Algorithm 5 - 0.033 0.0452

Table 5: Computation time. The time of convergence of Algorithms 1-5, for matrix Z of size of 100 × 10, and p = 0.01, 0.5, 0.99.

As it can be seen from Table 5, when the probability was set as p = 0.01, the correct result was obtained only by Algorithm 2, where other algorithms did not converge. In case of p = 0.5 and p = 0.99 all algorithms gave correct results, and for both of these cases, Algorithm 5 had the shortest time of convergence.

The case of matrix Z, for which PDZ is dicult be diagonally balanced

Recalling the results given in Section 1.1.2, the projection matrix PDZ can be

transformed to a projection matrix with a balanced diagonal only if matrix Z can be dened as a direct matrix sum of blocks of the same size. However, in case when Z is dened as a direct matrix sum of dierent size blocks, one can try to construct a matrix Zα _≈_{Z, for which diagonal balancing would be possible.}

Consider matrix Z, which consists of l diagonal blocks Zi,i of dierent size and

(35)

a balanced diagonal. Therefore, one could construct matrix Zα _{consisting of same}

diagonal blocks as Z, but instead of 0-s in remaining entrances, one could set them as equal to the realizations of N (0, α), where α is relatively small comparing to elements in Zi,i. Such established Zα is the one block matrix, and therefore the

balanced diagonal can be obtained. Thus, the task would be to check whether matrix Dα _{for which P}

Dα_Zα has a balanced diagonal, will give the matrix P_Dα_Z

with a balanced diagonal or at least the diagonal will be close to balanced. To do this, let the Z1,1be a 20×5 matrix, Z2,2be 30×5 and Z3,3be 50×5, where the

values of all entries of these matrices are i.i.d drawn from N (100, 15). Then, the matrix Zα_{has blocks Z}

1,1, Z2,2, Z3,3on the diagonal and the remaining elements are

equal to the realizations of N (0, α), where α = 10−10_{, 10}−5_{, 10}−1_{, 10}_{. Having such}

a matrix Zα_{, I run every algorithm to obtain D}α_{. As usual, the starting value is set}

asqk

nın and the stopping value as ε = 10

−6_{. The results obtained by Algorithms}

2, 3, 4, 5 were correct i.e. the matrix PDα_Zα is diagonally balanced, for every value

of α. Algorithm 1 did not converge to the nal result for α = 10−10_{, 10}−5_{. Then,}

by using these results, let me check whether the obtained Dα _{is also the solution}

for PDα_Z. To do so, I have calculated the value of the objective function given by

(5) for Dα _{obtained from algorithms, with the exception of Algorithm 1 for values}

α = 10−10, 10−5. The results are given in Table 6.

α Algorithm 1 Algorithm 2 Algorithm 3 Algorithm 4 Algorithm 5

10−10 - 0.2501 0.3412 0.3405 0.2403

10−5 - 0.2499 0.3206 0.3201 0.2301

10−1 1.4331 0.3013 0.3013 0.3012 0.3009

10 1.7784 0.8003 0.8001 0.8003 0.8001

Table 6: The value of objective function S1(d), for matrices Dα, where α =

10−10, 10−5, 10−1, 10, obtained by Algorithms 1-5.

As can be seen from the table above, S1(d) is much bigger than the usually

as-sumed tolerance level ε, i.e. 10−6_{, therefore it can be concluded that when blocks}

(36)

(37)

2 Consistent IV estimators under many-instruments

asymptotics with heteroskedasticity.

In this section I will consider the instrumental variable estimation, and in par-ticular a consistent estimation under many-instruments asymptotics. Most of the attention will be devoted to the LIML estimator, which is commonly called lim-ited information maximum likelihood or LIML estimation. As shown by Bekker and van der Ploeg (2005), generallyf LIML is inconsistent under many-instruments asymptotics with heteroskedasticity. Therefore, I will present estima-tors like HLIM, HFUL introduced in Hausman et al. (2011) and SJEF given in Bekker and Crudu (2012), which have been shown to be consistent under many-instruments asymptotics with heteroskedasticity.

However, as will be shown, it is possible to obtain a modied version of LIML which will be consistent under heteroskedasicity. In order to construct such an estimator, the results derived in Section 1 will be used. In particular, I will use the matrix D, for which the projection matrix PDZ has a balanced diagonal, and

which will scale the observations of IV model.

Therefore, the procedure of obtaining such an estimator will be introduced, and this estimator will be called the balanced limited information maximum like-lihood estimator or BLIM. Additionally, I will also introduce the Fuller (1977) modication of BLIM, which will be called FBLIM. Then, it will be shown that the BLIM estimator is consistent under many-instruments asymptotics.

Finally, by using Monte Carlo simulations, HLIM, HFUL and SJEF estimators will be compared with the BLIM and FBLIM estimators.

2.1 The model

Consider observations in the n vector y and the n × l matrix X which satises y = Xβ0+u,

(38)

where β0is the l vector, Π is the k×l matrix of unknown parameters, Z is the n×k

matrix of instruments, and u and V are respectively the n vector of disturbances and the n × l matrix of disturbances. Then, similar to Bekker and Crudu (2012) let me assume that Z is nonrandom and the disturbances (ui,Vi)are independent,

with zero mean and covariance matrices given by Σi = σ2 i σ 0 12i σ21i Σ22i ! , (28)

where Σ22i is a k × k, covariance matrix, σ21i and σ12i are a k vectors, and σi2 is

a scalar. The model given by (27) can be rewritten in a reduced form, which will be more convenient for further usage. Therefore, let this reduced form be given as follows (y, X) = ZΠ(β₀,Il) + (v, V), (29) where (v, V) = (u, V) 1 0 0 β₀ Il ! . (30)

Using the reduced form given by (29), it is easy to observe that the covariance matrices of the rows (yi,Xi) are given by

Ωi = 1 β0₀ 0 Il ! Σi 1 00 β₀ Il ! .

In order to identify the parameters of model (29), and therefore to be able to estimate them, some general assumption corresponding to matrices Z and X have to be made. Davidson and MacKinnon (2004) states that for identication of (29), it is required that plim n→∞ 1 nZ 0_{X = S} Z0_X (31)

is deterministic and it has a full column rank. It is easy to notice that this as-sumption implies that l ≤ k, which means that the number of instruments has to be at least as big as the number of variables.

Additionally, it is also assumed that Z0_{Z is nonsingular and}

plim n→∞ 1 nZ 0_{Z = S} Z0_Z> 0, (32)

(39)

2.2 The LIML estimator and some other estimators

In this section I will present a few estimators of the parameters of model (29). Most attention will be devoted to the limited information maximum likelihood estimator. Therefore, the consistency of this estimator will be checked both under large sample asymptotics and many-instruments asymptotics. As will be seen, and as shown by Bekker and van der Ploeg (2005), the LIML estimator is inconsistent under many-instruments asymptotics with heteroskedasticity. Therefore, I will present a procedure of obtaining BLIM and FBLIM, which are consistent in this case.

2.2.1 The LIML estimator

According to Davidson and MacKinnon (2004), the LIML estimator for the model, can be written as the solution of the following minimization problem

ˆ β_{LIM L} =argmin β (1, −β0)(y, X)0P(y, X)(1, −β0)0 (1, −β0)(y, X)0₍_I n−P)(y, X)(1, −β0)0 . (33)

Obviously, (33) can be solved by minimizing the following objective function QLIM L((1, −β0)0) =

(1, −β0)(y, X)0P(y, X)(1, −β0)0 (1, −β0)(y, X)0₍I

n−P)(y, X)(1, −β0)0

. (34)

The objective function (34) is minimized by ˆβLIM L, which can be found by the

procedure given below. First, let me introduce the following notation A = (y, X)0_{P(y, X),} B = (y, X)0 (In−P)(y, X), x = (1, −β0 )0, (35) then, notice that

β = −(0, Il)x e1x

. (36)

By applying this notation to the objective function given in (34), the following reformulation of the objective function is obtained

QLIM L(x) = x 0_Ax

(40)

Then, in order to minimize (37), the rst-order conditions can be used. Therefore, by dierentiating (37) with respect to x, and then by setting it to the zero, the following result is acquired

2Ax(x0Bx) − 2Bx(x0Ax) = 0. (38)

Then, by dividing (38) by 2x0_{Bx it becomes}

Ax − x0_Ax x0_Bx Bx = 0 and by substituting (37), it is Ax − QLIM L(x)Bx = 0. (39)

Davidson and MacKinnon (2004) show that the set of rst-order conditions, equiv-alent to (39), can be written as

nB−1/2_AB−1/2

− QLIM L(x)Ino B1/2x = 0.

Also, they claim that this set of rst order conditions is equivalent to the eigenvalue problem for a real symmetric matrix. Therefore, the smallest value of QLIM L(x)

is the smallest eigenvalue of matrix B−1/2_AB−1/2_{. Then, the argument for which}

the value of QLIM L(x) is the smallest i.e. ˆx, can be given by

ˆ

x = B−1/2_z, ₍₄₀₎

where z is the eigenvector of B−1/2_AB−1/2_{with the smallest eigenvalue. Therefore,}

substituting ˆx to (36), the ˆβLIM L is obtained and given by

ˆ

β_{LIM L} = −(0, Il)B

−1/2_z

e1B−1/2z

. (41)

This value of ˆβLIM L solves the following moment condition

[A − QLIM L(ˆx)B](1, −ˆβ 0 LIM L)

0

=0, (42)

(41)

An other explicit formulation of LIML and its Fuller (1977) modication, given in Bekker and Crudu (2012), is given by

ˆ β = {X0PX − λhX0(In−P)X} −1 {X0Py − λh(ˆx)X0(In−P)y)} , λh = λ − α n − k λ = 1/λmax[A−1B] (43) where LIML is obtained for α = 0 and the Fuller modication for α = 1.

2.2.2 The consistency of the LIML estimator

Since the LIML estimator is the main object of interest, in this section it will be checked whether this estimator is consistent under the large sample asymptotics and many-instruments asymptotics, both in case of assumed homoskedsticity or heteroskedasticity.

Large sample asymptotics

Firstly, I will consider a consistency of the LIML estimator in case of large sample asymptotics i.e. n → ∞ and xed k. In order to show that the LIML estimator is consistent in this case, I will show that in the probability limit the value of ˆβLIM L

given by (33) is equal to β0, i.e.

plim

n→∞

ˆ

β_{LIM L} = β₀.

This is to say, that asymptoticly QLIM L(x) is minimized by β0. Therefore, let me

consider A and B given in (35), for which the expectations are given by E(A) = (β0,Il)0Π0Z0ZΠ(β0,Il) + n X i=1 PiiΩi, (44) and E(B) = n X i=1 (1 −Pii)Ωi. (45)

(42)

and plim n→∞ B n = limn→∞E( B n) = limn→∞ Pn i=1(1 −Pii)Ωi n , (47)

Then, by using assumptions (46) and (47), the probability limit of the objective function QLIM L(x) can be described by

plim n→∞ QLIM L(x) = plim n→∞ (1, −β0)A/n(1, −β0)0 (1, −β0)B/n(1, −β0)0 = lim_n→∞ (1, −β0)E(A)(1, −β0)0 (1, −β0)E(B)(1, −β0)0,

which, by applying (44) and (45), is equal to lim n→∞ (1, −β0)(β₀,Il)0Π0Z0ZΠ(β0,Il)(1, −β0)0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β 0 )0 + lim_n→∞ (1, −β0)Pn i=1PiiΩi(1, −β 0 )0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β 0 )0. (48) Recalling assumption (32), notice that Π0_Z0_{ZΠ is positive semi-denite.}

There-fore, it should be clear that the rst term in (48) will be minimized for β = β0,

since

(1, −β0)(β₀,Il)0Π0Z0ZΠ(β0,Il)(1, −β0)0 =0, (49)

for β = β0.

In case of homoskedastic errors i.e. Ωi = Ω, the plim n→∞ QLIM L(x) becomes plim n→∞ QLIM L(x) = lim n→∞ (1, −β0)(β₀,Il)0Π0Z0ZΠ(β0,Il)(1, −β0)0 (1, −β0)Ω(1, −β0)0Pn i=1(1 −Pii) + lim n→∞ (1, −β0)Ω(1, −β0)0Pn i=1Pii (1, −β0)Ω(1, −β0)0Pn i=1(1 −Pii) (50)

and it is clearly minimized by β = β0, since the result given in (49) holds and

lim n→∞ Pn i=1Pii Pn i=1(1 −Pii) = 0. (51)

Therefore, the LIML estimator, which minimizes QLIM L(x) is consistent.

A similar result occur when error terms are heteroskedastic, i.e. Ωi diers with

respect to i. Then, the plim

n→∞

QLIM L(x) is given by (48). Since the heteroskedasticity

only aects the noisy part of the expectation of A i.e. Pn

(43)

given in (49) still holds. Therefore, if assumptions (32), (46) and (47) hold, then plim

n→∞

QLIM L(x) is minimized at β = β0, since

lim n→∞ (1, −β0)Pn i=1PiiΩi(1, −β 0 )0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β0)0 = 0.

Thus, the LIML estimator remains consistent under large sample asymptotics with heteroskedasticity.

Many-instruments asymptotics

In case of many-instruments asymptotics it is assumed that not only the sample size grows to innity i.e. n → ∞, but also the number of instruments k → ∞. However, it is also assumed that both of these values increase with the same speed i.e. for n → ∞ and k → ∞ there exists a xed number 0 ≤ α < 1 that k

n = α.

When k

n → α, and error terms are assumed to be homoskedastic, the result given

in (51) does not hold, and now becomes lim n,k→∞ Pn i=1Pii Pn i=1(1 −Pii) = lim n,k→∞ k n − k = α 1 − α. (52)

However, this does not change the consistency of the LIML estimator, since plim

n,l→∞

QLIM L(x)

is still minimized by β = β0. To see this notice that

plim n,k→∞ QLIM L(x) = lim n,k→∞ (1, −β0)(β₀,Il)0Π0Z0ZΠ(β0,Il)(1, −β 0 )0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β 0 )0 + lim n,k→∞ (1, −β0)Ω(1, −β0)0Pn i=1Pii (1, −β0)Ω(1, −β0)0Pn i=1(1 −Pii) = α 1 − α (53) for β = β0. Therefore the LIML estimator remains consistent under

many-instruments asymptotics with homoskedasticity.

(44)

In order to see that the LIML is inconsistent, consider plim

n,l→∞

QLIM L(x) given in

(48). Under heteroskedasticity, the second term in (48) may not have the minimum at β0, and so the objective function will not be minimized at β0. Following

Hausman et al. (2011) the derivative of the second term in (48), with respect to β is given by ∂ ∂β (1, −β0)Pn i=1PiiΩi(1, −β 0 )0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β0)0 β=β0 = −2Cov_σ2 Pii,E(X iui) σi , (54) where Covσ2P_ii,E(Xiui)

σi is the covariance between Pii and

E(Xiui)

σi , for the

dis-tribution with probability weight σ2 i

Pn

i=1σ2i for i-th observation. When

lim n,l→∞Covσ 2 Pii,E(X iui) σi 6= 0,

the derivative (54) at β0 will be asymptotically dierent than 0 and therefore

QLIM L(x) will not be minimized at β0. Therefore, the LIML estimator may be

inconsistent under many-instruments asymptotics with heteroskedasticity.

However, if the value of Pii is the same for every i, the covariance given in (54)

will be equal to zero. Then, it is easy to notice that since Pii = k_n

plim n,k→∞ QLIM L(x) = lim n,k→∞ (1, −β0)(β₀,Il)0Π0Z0ZΠ(β0,Il)(1, −β0)0 (1, −β0)Pn i=1(1 −Pii)Ωi(1, −β0)0 + lim n,k→∞ (1, −β0)Pn i=1Ωi(1, −β0)0Pii (1, −β0)Pn i=1Ωi(1, −β 0 )0_{(1 −}P ii) = α 1 − α

for β = β0. Therefore, by using projection matrix P with equal diagonal

ele-ments, it seems to be possible to obtain a consistent LIML estimator under many-instruments asymptotics with heteroskedasticity.

2.2.3 HLIM, HFUL and SJEF estimators

(45)

HLIM

The rst considered estimator is introduced by Hausman et al. (2011) and is a jackknife version of LIML, called HLIM. This estimation procedure uses a modied LIML objective function, given by

QHLIM(β) =

(1, −β0)(y, X)0(P − F)(1, −β0)0

(1, −β0)(y, X)0₍y, X)(1, −β0₎0 , (55)

where F = Diag (diag(PZ)). Then, the HLIM estimator is formulated in the

following way ˆ

β_HLIM = {X0(P − F)X − ˆαX0X}−1{X0(P − F)y − ˆαX0y}, (56) where

ˆ

α = λmin[{(y, X)0(y, X)}−1(y, X)0{P − F}(y, X)].

and where λminindicates the smallest eigenvalue. Corresponding to Hausman et al.

(2011) this estimator is consistent under many-instruments with heteroskedastic. Therefore, I will use this estimator to compare the results with the LIML estima-tion given in the Monte Carlo simulaestima-tions.

HFUL

The second estimator, HFUL, is a modication of Fuller (1977) estimator, that is also given by Hausman et al. (2011). It is also based on the minimization of the objective function given by (55). The formulation of this estimator is similar to the formulation of HLIM and given by

ˆ

β_{HF U L}= {X0(P − F)X − ˆαX0X}−1{X0(P − F)y − ˆαX0y}, (57) however, the value of ˆα is dierent and given as follows

ˆ

α = (n + 1) ˜α − 1 n + ˜α − 1 , ˜

α = λmin[{(y, X)0(y, X)}−1(y, X)0{P − F}(y, X)].

The balanced limited information maximum likelihood estimator.

The balanced limited information maximum

likelihood estimator.

Pawel Wilk

The balanced limited information maximum

likelihood estimator.

Pawel Wilk

August 10, 2012

Contents

Introduction

1 Various ways of obtaining a projection matrix

with a balanced diagonal

1.1 The notation, denitions and characteristics of the

al-gebraic problem

1.2 Problem solution

1.3 Simulations of the solution of the problem

2 Consistent IV estimators under many-instruments

asymptotics with heteroskedasticity.

2.1 The model

2.2 The LIML estimator and some other estimators

1.1 The notation, denitions and characteristics of the