faculteit Wiskunde en Natuurwetenschappen

## Missing value estimation for DNA microarrays

### Bachelor thesis Mathematics

June 2013

Student: F. Doddema

First supervisor: Prof. dr. H.L. Trentelman Second supervisor: Prof. dr. J. Top

Abstract

A DNA microarray is a biological tool to measure the activity of genes in different samples. These samples are, for example, coming from a tumor. A DNA microarray can measure simultaneously the activity for M genes in N samples, where M ≈ 20, 000 and N ≈ 20.

The results can be given as an M ×N matrix, the expression matrix E.

Sometimes elements in this matrix are missing due to various factors, for example dust on the chip or machine error. In this thesis we describe several methods to estimate the missing values. We compare these methods using some DNA microarray (online available).

### Contents

1 Introduction 5

1.1 Genes, DNA and RNA . . . 5

1.2 DNA microarray . . . 6

1.3 Missing values in DNA microarrays . . . 8

2 Mathematical background 9 3 Structure of the subsequent sections 13 4 Zero replacement method 14 4.1 How does it work? . . . 14

4.2 Algorithm . . . 14

4.3 Numerical results . . . 14

5 Row and column sum mean method 15 5.1 How does it work? . . . 15

5.1.1 Row sum mean method . . . 15

5.1.2 Column sum mean method . . . 15

5.2 Algorithm . . . 15

5.2.1 Row sum mean method . . . 15

5.2.2 Column sum mean method . . . 15

5.3 Numerical results . . . 16

5.3.1 Row sum mean method . . . 16

5.3.2 Column sum mean method . . . 16

6 Singular Value Decomposition methods 17 6.1 SVD and Expression matrix . . . 17

6.2 How does it work? . . . 19

6.2.1 Estimation method 1 using SVD . . . 19

6.2.2 Estimation method 2 using SVD . . . 20

6.3 Algorithm . . . 20

6.3.1 Estimation method 1 using SVD . . . 20

6.3.2 Estimation method 2 using SVD . . . 21

6.4 Numerical results . . . 22

6.4.1 Estimation method 1 using SVD . . . 22

6.4.2 Estimation method 2 using SVD . . . 22

7 Least Squares imputation method 23

7.1 Selecting t genes . . . 23

7.2 Regressing with LSimputation . . . 24

7.3 Finding t . . . 25

7.4 Algorithm . . . 26

7.5 Numerical results . . . 26

8 Fixed Rank Approximation method 27 8.1 How does it work? . . . 27

8.2 The algorithm . . . 30

8.3 Numerical results . . . 31

9 Conclusion 32 10 Discussion 33 A DNA microarray 34 B Zero replacement method 34 C Row and column sum mean method 34 C.1 Row sum mean method . . . 34

C.2 Column sum mean method . . . 35

D Singular Value Decomposition methods 35 D.1 Estimation method 2 using SVD . . . 35

D.2 Estimation method 2 using SVD . . . 36

E Least Squares imputation method 38

F Fixed Rank Approximation method 40

### 1 Introduction

Terms as genes and DNA are used a lot in daily life. It is used to identify suspects of criminal cases, but DNA also tells us a lot about our body. In DNA one can, for example, find the colour of a person’s eyes and hair. Ac- tually one’s whole appearance is described in his DNA.

DNA cannot only tell something about how someone looks like, it can also tell something about the inside of a person’s body. In this way researchers try to develop drugs against, for example, cancer. In section 1.1 we will give an introduction of genes, DNA and RNA. In section 1.2 we will explain a method, called DNA microarray, that is used to understand the working of genes in particular diseases. In the last section of this introduction, section 1.3, we will explain which problem we will discuss in this thesis and how we will try to solve it with mathematics.

### 1.1 Genes, DNA and RNA

Looking at a human being we see one body. However, if we would look at
a human body with a microscope, we would see that it exists of a lot of
cells. The average size of a human cell is 15µm (15 × 10^{−6}m). Each cell
contains cytoplasm, a fluid that contains all the organels of a cell and the
cytoplasm is surrounded by a membrane. Some examples of organels are
mitochondria, ribosomes and the nucleus, see figure 1(a). The nucleus of a
cell contains DNA, which is a molecule that is made of two complementary
strands, which form a double helix. Each strand is formed by 4 kinds of
bases, called C, G, A and T (cytosine, guanine, adenine and thymine). The
two strands are held together by hydrogen bonds, which connect matching
pairs of bases, C is connected with G and A with T (see figure 1(b)). The
nucleus contains a lot of DNA, which is spread among 46 chromosomes. The
DNA exists of multiple segments which are called genes. Every cell of a hu-
man being contains approximately 30,000 to 40,000 genes. This means that
every cell of a human body contains all genes (a cell in the foot, for example,
also contains DNA that codes for the colour of your eyes). The particular
sequence of bases in a gene plays a very important role in life. The reason
for this is that the particular sequence corresponds to a particular protein.

An example of a protein is hemoglobin, a protein that is involved in the transport of oxygen and carbon dioxide in blood.

(a) Human cell [10] (b) DNA molecule[1]

We hope that it is clear that proteins are very important molecules in life.

Now we will explain how proteins are produced. As said before, DNA that codes for proteins is situated in the nucleus of a cell. However, proteins are not produced in the nucleus, they are produced by ribosomes, an organel outside the nucleus in the cytoplasm. For this reason we would say that the segment of DNA shoud be ripped out the nucleus and transferred to a ribosome. However, this process is a little complicater. When a gene receives a command to be coded for a protein, the segment of DNA that is needed opens, such that we have two strands of DNA. Then, one strand is copied, the copy is called messenger RNA (mRNA), this process is called transcription. This piece of mRNA goes out of the nucleus to a ribosome where the particular sequence of bases codes for a protein, this process is called translation. Figure 1(c) describes the mechanism of transcription and translation. The collection of all genes that code for proteins is called the genome. [9]

(c) Transcription and Translation[2]

thousands of genes in a cell, a DNA microarray is used. A DNA microarray is a chip with thousands of ’holes’ in it. In every ’hole’ we can find a sequence of single stranded DNA that corresponds to a gene. When we have some cells of, for example, a tumor, the mRNA in that cells will be extracted and labeled fluorescently. This RNA is spread over the chip. When a RNA molecule from the tumor encounters a fragment of DNA that matches, then it will form a double helix with that DNA fragment. Genes that do not correspond to the RNA will stay single stranded. Then the chip is washed off such that only the molecules that found their perfect match will remain stuck to the chip.

Using a laser we can measure the expression level of each gene, the result looks like figure 1(d). This can be done because the RNA from the tumor was marked fluorescently. For a more detailled description of the working of a DNA microarray, we refer to [7]. We described the basic principles of a DNA microarray for only one sample (the tumor). In practice a DNA microarray is done simultaneously for more than one sample, for example 100 samples. The DNA microarray contains the DNA for approximately 20,000 genes. The result of this DNA microarray can be translated to a M × N matrix (rectangular array with M rows and N columns), where each row corresponds to a gene and each column to a sample. [9] [7]

(d) DNA microarray [7]

### 1.3 Missing values in DNA microarrays

Sometimes elements in the expression matrix are missing due to various fac- tors, for example dust on the chip or machine error. In this thesis we will discuss several methods to estimate the missing values in the expression ma- trix E. We will discuss the following methods:

1. Zero replacement method

2. Row and column sum mean method 3. Singular Value Decomposition methods 4. Least Squares imputation method 5. Fixed Rank Approximation method

In section 2 we will give some mathematical background that is needed to understand the contents of this thesis. In section 3 we will discuss the struc- ture of sections 4 to 8. Here we will explain how we will discuss every method and how we will test it. In section 9 we will compare all the methods based

### 2 Mathematical background

In this section we will explain all the mathematical background that is needed to understand the contents of this thesis.

In this thesis we deal with real matrices. A real matrix is a rectangular array
of real numbers. The size of a matrix is denoted by M × N , where M is the
number of rows and N is the number of columns. For a matrix E, we denote
the element in row i and column j by E_{ij}. In this thesis E is the expression
matrix were E_{ij} stands for the expression level of gene i in sample j. When
E is of size M × N , this means that we deal with M genes and N samples.

Let us call row i of E g_{i} and column j c_{j}. In this way g_{i} gives the expression
levels of the N samples for gene i and c_{j} gives the expression levels for the
M genes in sample j. So E can be written as

E =

g_{1}
g_{2}
...
g_{M}

= c1 c_{2} . . . c_{N} .

Definition 1. Let E be an M × N matrix. The row space of E is the
subspace of <^{N} spanned by the row vectors of E. Similarly the column space
of E is the subspace of <^{M} spanned by the column vectors of E.

Definition 2. Let E : V → W be a linear transformation and let S be a subspace of V . The image of S, denoted L(S), is defined by

L(S) = {w ∈ W | w = L(v) for some v ∈ S}.

The image of the entire vector space, L(V ), is called the range of L.

Using definition 2 [8] we can think of a matrix E as the linear transformation
mapping <^{N} to <^{M}. In that way we see that the column space of E is exactly
the same as the range of E, R(E). In this thesis we will denote the column
space of E by R(E). The row space of E will be denoted by R(E^{T}), because
the column space of E^{T} is the same as the row space of E.

Definition 3. Let x_{1},x_{2},. . . ,x_{n}be nonzero vectors in an innner product space
V . If hx_{i}, x_{j}i = 0 whenever i 6= j, then x_{1}, x_{2}, . . . , x_{n} is said to be an
orthogonal set of vectors.

Definition 4. An orthonormal set of vectors is an orthogonal set of unit vectors, where unit vectors are vectors of length 1.

Definition 5. An M × M matrix U is called an orthogonal matrix if the
column vectors of U form an orthonormal set in <^{M}.

Definition 6. λ and x 6= 0 are respectively an eigenvalue and an eigenvector of a matrix E if Ex = λx.

Theorem 1. If A is a real symmetric matrix, then the rank of A is the same as the number of nonzero eigenvalues (taking into account their algebraic multiplicities).[8]

Theorem 2 (The SVD Theorem). If E is an M × N matrix, then E can
be factorized into U ΣV^{T} where U is an M × M orthogonal matrix, V is an
N × N orthogonal matrix and Σ is an M × N matrix whose off-diagonal
entries are all zeros and whose diagonal elements satisfy σ_{1} ≥ σ_{2} ≥ . . . ≥
σ_{r} > σ_{r+1} = . . . = σ_{min(N,M )} = 0, where r denotes the rank of E. This
decomposition is called the singular value decomposition.

In our case E is the expression matrix were M denotes the number of
genes and N the number of samples. In general we can assume that the
number of genes is much greater than the number of samples, so we can
state M ≥ N . In that case we can say that the singular values satisfy
σ_{1} ≥ σ_{2} ≥ . . . ≥ σ_{r} > σ_{r+1} = . . . = σ_{N} = 0. This will be a standing assump-
tion throughout this thesis. Moreover we denote the rank of E by r.

The proof of this theorem falls out of the scope of this thesis. For a proof we
refer to [8]. In the proof one constructs the matrices U , Σ and V^{T} to show
that such a decomposition exist. Here we will only mention how to construct
these matrices.

Let λ_{1}, λ_{2}, . . . , λ_{N} be the eigenvalues of E^{T}E, such that

Next let Σ be the M × N matrix defined by

Σ =Σ_{1} O

O O

with Σ_{1} =

σ_{1}

σ_{2}
. ..

σ_{r}

(4)

where r denotes the rank of E. Because E is of rank r, E^{T}E is also of
rank r. Moreover, E^{T}E is symmetric, so the number of nonzero eigenvalues
(and consequently its number of nonzero singular values) equals its rank, see
theorem 1. Now define

u_{j} = 1

σ_{j}Ev_{j} for j = 1, . . . , r. (5)
Let u_{r+1}, u_{r+2}, . . . , u_{M} be an orthonormal basis for N (E^{T}) and set

U = (u_{1}, u_{2}, . . . , u_{r}, u_{r+1}, . . . , u_{M}). (6)
So U is an M × M orthogonal matrix.

If we define U , Σ and V is this way, then

E = U ΣV^{T}. (7)

We now list some properties of the singular value decomposition without proving them. For a proof we refer to [8].

1. The vj’s are called the right singular vectors of E, and the uj’s are called the left singular vectors of E.

2. If E has rank r, then

(a) v_{1}, . . . , v_{r} form an orthonormal basis for R(E^{T}).

(b) v_{r+1}, . . . , v_{N} form an orthonormal basis for N (E).

(c) u_{1}, . . . , u_{r} form an orthonormal basis for R(E).

(d) u_{r+1}, . . . , u_{M} form an orthonormal basis for N (E^{T}).

Theorem 3. If E is an M × N matrix, then there exists a unique N × M
matrix E^{†} that satisfies the following conditions:

1. EE^{†}E = E
2. E^{†}EE^{†}= E^{†}

3. (EE^{†})^{T} = EE^{†}
4. (E^{†}E)^{T} = E^{†}E

E^{†}is called the Moore-Penrose pseudoinverse of E and is computed as follows
E^{†}= V Σ^{−1}_{1} O

O O

U^{T},

where U , V and Σ_{1} are matrices coming from the SVD of E.[8]

### 3 Structure of the subsequent sections

In the next sections we will discuss the methods that we have mentioned in section 1.3. Before starting to discuss the first method we will explain how we set up each of these sections. In each section we will first discuss how the method works. We will discuss the mathematical principle and give an algorithm for the method.

Next, we will perform the method on some testing expression matrix E from
a DNA microarray. This testing matrix is available online[11]. From the
testing matrix we only use those genes (rows) and samples (columns) that
contain all entries, such that we have no missing values. This matrix is what
we call our full testing matrix E_{f ull}. Because we try to estimate missing val-
ues, we remove randomly entries from E_{f ull}, the remaining matrix is denoted
by E. In every method we try to estimate the missing values, the matrix
we obtain is E_{estimate}. Because we also have E_{f ull}, we can compute the error
between E_{f ull} and E_{estimate}. This error is computed with the Frobenius norm
of E_{f ull} − E_{estimate} divided by the mean of all elements of E_{f ull}, µ and the
squareroot of the number of rows of E, √

M . We use this error because this error is also used in literature, see for example [3]. In this way we can easily compare our results to results from literature. Denoting the error by e, we have

e = 1 µ√

MkE_{f ull} − E_{estimate}k_{F}

= 1

µ√ M

q

hE_{f ull} − E_{estimate}, E_{f ull} − E_{estimate}i

= 1

µ√ M

v u u t

M

X

i=1 N

X

j=1

((E_{f ull})_{ij} − (E_{estimate})_{ij})^{2}.

Using this error we can compare the methods and show which method gives the best results, i.e. the smallest error. This will be done in section 9.

### 4 Zero replacement method

The zero replacement method is the first method we will discuss. The zero replacement method is a very easy method to understand, but does not give good results as we will see later on.

### 4.1 How does it work?

If elements of our expression matrix E are missing, this method tells us that we should place zeros on the place of the missing elements. In this way the other elements of the expression matrix are not used and that is why this method does not give very good results.

### 4.2 Algorithm

The algorithm for this method is as follows

1. Replace all the missing values in E by zeros, this gives Eestimate.

### 4.3 Numerical results

If we run the matlab code in appendix B 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.3920

5% 0.8623

10% 1.2417

15% 1.5316

20% 1.7673

### 5 Row and column sum mean method

The second and third method to estimate the missing values are the Row and Column sum mean method.

### 5.1 How does it work?

5.1.1 Row sum mean method

In this method missing values are estimated by the mean of present elements in the same row of the missing element. In this way an element is estimated by the expression levels of the same gene in other samples.

5.1.2 Column sum mean method

This method works almost the same as the Row sum mean method, how- ever now missing values are estimated by the mean of the elements in the same column. This means that an element is estimated by the mean of the expression levels of other genes in the same sample.

### 5.2 Algorithm

5.2.1 Row sum mean method

1. Replace missing values in E by the mean of non missing elements in
the same row. So if E_{ij} is missing, replace E_{ij} by the mean of the non
missing element in g_{i}. Doing this for all the missing elements gives
E_{estimate}.

5.2.2 Column sum mean method

1. Replace missing values in E by the mean of non missing elements in
the same column. So if E_{ij} is missing, replace E_{ij} by the mean of the
non missing element in c_{j}. Doing this for all the missing elements gives
E_{estimate}.

### 5.3 Numerical results

5.3.1 Row sum mean method

If we run the matlab code in appendix C.1 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.1110

5% 0.2636

10% 0.4140

15% 0.4633

20% 0.5016

5.3.2 Column sum mean method

If we run the matlab code in appendix C.2 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.1096

5% 0.2583

10% 0.3590

15% 0.4353

20% 0.5472

### 6 Singular Value Decomposition methods

In section 2 we have discussed the basic principles of the Singular Value Decomposition. In the remaining part of this thesis this decomposition will be abbreviated as the SVD. In this section we will discuss the application of the SVD in the estimation of missing values in an expression matrix E. In the first part of this section we will give some general information about he application of the SVD in estimating missing values of our matrix E. In the second and third part we will discuss two methods, both using the SVD, for estimating the missing values in matrix E. The first method is discussed in [3] and the second in [13].

### 6.1 SVD and Expression matrix

In section 2 we have discussed the SVD. We have shown that our M × N
expression matrix E (with M ≥ N ) can be factorized as E = U ΣV^{T} with U
and V orthogonal matrices and Σ a matrix whose off-diagonal entries are all
zero and whose diagonal elements satisfy σ_{1} ≥ σ_{2} ≥ . . . ≥ σ_{r} > σ_{r+1} = . . . =
σ_{N} = 0 (with r the rank of E).

In the expression matrix E we assume that some elements are missing and we want to estimate these elements. However, in order to compute the SVD of the matrix, the matrix should contain all of its elements. That is why we have to assign specific numbers to the missing values before we can compute the SVD. We can choose to replace the missing values with zero (according to the zero replacement method, section 4). We could also choose to fill the missing entries with the mean of the rows or the columns (see section 5). Values computed from another method could also be used. Having done that, we can compute the SVD of E.

If E = U ΣV^{T}, we denote the columns of U as u_{1}, u_{2}, . . . , u_{M} (the left singular
vectors) and the columns of V as v_{1}, v_{2}, . . . , v_{N} (the right singular vectors).

We know that if the matrix E has rank r, then σ_{i} = 0 for i = r + 1, . . . , N .

Using these properties, we can write E = U ΣV^{T} as
E = U ΣV^{T}

= (u1, u2, . . . , uM)

σ_{1}

σ2

. ..

σ_{r}
. ..

σ_{N}

v_{1}^{T}
v_{2}^{T}
...
v_{N}^{T}

= σ_{1}u_{1}v_{1}^{T} + σ_{2}u_{2}v_{2}^{T} + . . . + σ_{r}u_{r}v_{r}^{T} + σ_{r+1}u_{r+1}v_{r+1}^{T} + . . . + +σ_{N}u_{N}v^{T}_{N}

= σ_{1}u_{1}v_{1}^{T} + σ_{2}u_{2}v_{2}^{T} + . . . + σ_{r}u_{r}v^{T}_{r}

=

r

X

i=1

σ_{i}u_{i}v_{i}^{T}.

In this way we see that E can be written as U1Σ1V_{1}^{T} with

U_{1} = (u_{1}, u_{2}, . . . , u_{r}), Σ_{1} =

σ_{1}

σ2

. ..

σ_{r}

and V_{1} = (v_{1}, v_{2}, . . . , v_{r}).

In section 2 we have mentioned that u_{1}, u_{2}, . . . , u_{r} form an orthonormal basis
for R(E), the column space of E. E is our expression matrix and in this ma-
trix column i stands for the expression levels of all M genes in sample i. From
this perspective we can say that u_{1}, u_{2}, . . . , u_{r} form an orthonormal basis for
the subspace of <^{M} spanned by the columns of E. Similarly we can say that
v^{T}_{1}, v^{T}_{2}, . . . , v_{r}^{T} form an orthonormal basis for the subspace of <^{1×N} spanned

s the number of significant eigengenes. The corresponding eigensamples and singular values are called respectively significant eigensamples and sifnificant singular values. There are several methods to calculate s. Three, mostly used, methods are

1. fraction criterium
2. scree plots for σ_{q}^{2}
3. broken stick model.

Here we will only discuss the first method. For the other methods we refer to
[5]. The fraction criterium works as follows. For all r eigengenes we compute
the fraction p_{q}

p_{q} = σ^{2}_{q}
Pr

j=1σ_{j}^{2}, where q = 1, . . . , r.

In this way pq is a measure for the fraction of the expression level contributed
by the q^{th} eigengene [3]. Once p_{q} is computed for q = 1, 2, . . . , r, we know
that

pq1 ≥ pq2 ≥ . . . ≥ pqr

The s significant eigengenes are the s eigengenes for which

s

X

k=1

p_{q}_{k} ≥ 0.7, moreover

s−1

X

k=1

p_{q}_{k} < 0.7.

These significant eigengenes will be used to estimate the missing values in the two methods that will follow in next part of this section.

### 6.2 How does it work?

6.2.1 Estimation method 1 using SVD

This method comes from [3] and in our opinion it is unclear why this method would work. For that reason we will only give the algorithm and will not explain it.

6.2.2 Estimation method 2 using SVD

In this method we use a property of the SVD as mentioned in section 2. There
it was claimed that if we decompose E into U ΣV^{T}, then v1, v2, . . . , vr form
an orthonormal basis for R(E^{T}). In our case row i gives the expression levels
of gene i for the N samples. Consequently the (row)vectors v_{1}^{T}, v_{2}^{T}, . . . , v^{T}_{r}
form an orthonormal basis for the subspace of <^{N} spanned by the rows of
E. For this reason every row g_{k} of E can be written as a linear combination
of v^{T}_{1}, v^{T}_{2}, . . . , v_{r}^{T}. This means that we can find α_{1}, α_{2}, . . . , α_{r} such that

gk = α1v_{1}^{T} + α2v_{2}^{T} + . . . + αrv_{r}^{T}.

In this method we will use only those elements of g_{k} that are known and
the corresponding elements of v_{1}^{T}, v_{2}^{T}, . . . , v^{T}_{r} to find α_{1}, α_{2}, . . . , α_{r}. These
coefficients will be used to estimate the missing value(s) in g_{k} using the
corresponding elements in v_{1}^{T}, v_{2}^{T}, . . . , v_{r}^{T}. This will be done for all the missing
elements of E. This whole procedure will be repeated until there is almost
no change in E_{estimate}

### 6.3 Algorithm

6.3.1 Estimation method 1 using SVD

1. Replace the missing values with zeros or values from an other method.

This has been discussed in section 6.1. We call this matrix E0.

2. Calculate the matrices U , Σ and V used in the SVD for the matrix E_{t}
(starting with t = 0).

3. Calculate the fraction p_{q} for q = 1, . . . , r and find in this way the s
significant singular values.

is smaller than a certain number, for example 0.01, we stop the procedure.

The difference in Frobenius norm is calculated as follows

kE_{p+1}− E_{p}k_{F} =
q

hE_{p+1}− E_{p}, E_{p+1}− E_{p}i =
v
u
u
t

M

X

i=1 N

X

j=1

((E_{p+1})_{ij} − (E_{p})_{ij})^{2}.

We can also use diffferent stopping criteria, but in this thesis we will only use the Frobenius norm.

6.3.2 Estimation method 2 using SVD

1. Replace the missing values with zeros or values from an other method.

This has been discussed in section 6.1. We call this matrix E_{0}.

2. Calculate the matrix V as used in the SVD for the matrix E_{t} (starting
with t = 0).

3. Extract the vectors v_{1}, v_{2}, . . . , v_{r} from V where r denotes the rank of
E_{t}.

4. For each missing value write the row in which the missing value occurs
as a linear combination of v_{1}^{T}, v^{T}_{2}, . . . , v_{r}^{T}. So if E_{kq} is missing, let g_{k}^{0} be
row g_{k} where the q^{th} element is removed and let v^{T}_{1}^{0}, v_{2}^{T}^{0}, . . . , v_{r}^{T}^{0} be the
vectors v_{1}^{T}, v^{T}_{2}, . . . , v_{r}^{T} where the q^{th} is removed in every vector. Then
we should write row g^{0}_{k} as a lineair combination of v^{T}_{1}^{0}, v_{2}^{T}^{0}, . . . , v_{r}^{T}^{0}, i.e.

g_{k}^{0} = α_{1}v_{1}^{T}^{0}+ α_{2}v_{2}^{T}^{0}+ . . . + α_{r}v_{r}^{T}^{0}.

5. Using the coefficients α_{1}, α_{2}, . . . , α_{r}, we should estimate the q^{th} entry
g_{k} using the q^{th} entries of v_{1}^{T}, v_{2}^{T}, . . . , v_{r}^{T}.

6. Steps 4 and 5 should be done for all missing entries in E_{t}.

7. Replace the missing values in E_{t} by the estimated values computed in
steps 4, 5 and 6 and set E_{t+1}= E_{t}.

8. Repeat this procedure until the matrix does not change. Here we use again the Frobenius norm to compare Et+1 with Et.

9. The final matrix E_{t+1} is our E_{estimate}.

### 6.4 Numerical results

6.4.1 Estimation method 1 using SVD

If we run the matlab code in appendix D.1 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.1243

5% 0.2478

10% 0.3713

15% 0.4567

20% 0.6825

6.4.2 Estimation method 2 using SVD

If we run the matlab code in appendix D.2 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.6975

5% 1.5759

10% 1.9243

15% 2.2972

20% 2.4120

### 7 Least Squares imputation method

In this section we describe the method of least squares imputation, abbrevi-
ated as LSimputation. We follow the approach in [6]. In this method we can
distinquish two steps. In the first step we select t rows (genes) that are very
similar to the row in which a value (or values) is (or are) missing, say row g_{k}.
In the second step we will regress g_{k} against t rows and use these coefficients
of regression to estimate the missing value(s). To determine these coefficients
of regression, we use a method based on least squares. Before performing step
one, we have to find the number t. How t is found is described in section 7.3.

### 7.1 Selecting t genes

Here we will describe how we select the t genes mentioned above (assuming that t is known). In the literature different methods are described to find these t genes. Here we describe two methods: the first using the so called Pearson correlation coefficient (PCC) and the second using Euclidean dis- tance. First, we will describe the method using PCC, as proposed in [6].

Suppose g_{k} has a missing value in its q^{th} entry. Then set
g_{k}^{0} = g_{k1} g_{k2} . . . g_{k(q−1)} g_{k(q+1)} . . . g_{kN} .

Thus g_{k}^{0} is the row vector g_{k} where the missing value g_{kq} is deleted. If more
values in this gene are missing, we also delete these values. In the following
we assume that only one value is missing, say g_{kq}. If we want to compute
the PCC between g_{k} and g_{s} we have to set up the vector g^{0}_{s} by deleting the
q^{th} entry. We assume that gs does not have a missing value in its q^{th} entry.

The PCC r_{ks} between g_{k} and g_{s} is defined as
rks = 1

N − 1

N −1

X

w=1

(g_{kw}^{0} − ¯g_{k}^{0}

σ_{k} )(g^{0}_{sw}− ¯g_{s}^{0}
σ_{s} )

where ¯g_{k} is the average of values in g^{0}_{k}, ¯g_{s} the average of values in g_{s}^{0} and
σ_{k} and σ_{s} respectively the standard deviations of the elements of g_{k}^{0} and g^{0}_{s}.
The t genes with the largest absolute PCC with respect to g_{k} are the genes
that are most similar to g_{k}. Thus if we denote these t genes by

g_{s}_{i}, where i = 1, . . . , t
then

|r_{ks}_{1}| ≥ |r_{ks}_{2}| ≥ . . . ≥ |r_{ks}_{t}| ≥ |r_{ks}_{t+1}| ≥ . . . ≥ |r_{kM}|

The second method to find the t genes uses Euclidean distance, this method is proposed in [9]. In this method the distance between two genes in defined as

d_{E}(g_{i}, g_{j}) =
v
u
u
t

N

X

k=1

(g_{ik}^{0} − g^{0}_{jk})^{2}.

The t genes for which the Euclidean distance between this genes and g_{k} is
the smallest, are selected to be the t most similar genes, i.e.

d_{E}(g_{k}, g_{s}_{1}) ≤ d_{E}(g_{k}, g_{s}_{2}) ≤ . . . ≤ d_{E}(g_{k}, g_{s}_{t}) ≤ d_{E}(g_{k}, g_{s}_{t+1}) ≤ . . . ≤ d_{E}(g_{k}, g_{s}_{M}).

It is known [9] that the Euclidean distance is very sensitive to outliers. This problem does not occur with the PCC, therefore we will use the method using PCC to select the t genes.

### 7.2 Regressing with LSimputation

After we have obtained gs1, gs2, . . . , gst, we want to estimate the missing value
g_{kq}. We will estimate g_{kq} using linear regression of g_{k} against g_{s}_{1}, g_{s}_{2}, . . . , g_{s}_{t}.
Having done this, we will use the coefficients of regression to estimate g_{kq}
using the q^{th} elements of these t genes. Now set

b = (g_{s}_{1})_{q} (g_{s}_{2})_{q} . . . (g_{s}_{t})_{q} ,

then we want to find α_{1}, α_{2}, . . . , α_{t} to estimate g_{kq} in the following way

g_{kq}=

t

X

p=1

α_{p}(g_{s}_{p})_{q} = ba, where a =

α_{1}
α_{2}
...
αt

.

Now we set

As explained before we will then estimate the missing value as
g_{kq} = ba.

The solution to this problem is

a = (A^{T})^{†}w,

where (A^{T})^{†} is the Moore-Penrose pseudoinverse of A^{T} (see section 2). In
[8] it is proven that this is indeed the solution to this least squares problem.

Thus we estimate the missing value as

gkq = ba = b(A^{T})^{†}w.

If g_{k} has more than one missing value, then we have to adapt the algorithm.

In that case we should not create a vector b, but a matrix B ∈ <^{t×z} (assuming
z values are missing), where the t row vectors of B are formed by g_{s}_{i} for
i = 1, . . . , t, where we only use the elements corresponding to elements in
g_{k} that are missing. If, for example, the third, fifth and tenth element are
missing, then we only use the third, fifth and tenth elements of g_{s}_{i} to create
the matrix B. The matrix A ∈ <^{t×(N −z)} and the vector w ∈ <^{(N −z)×1} are
formed in almost the same way as before, except that we take into account
that more values are missing. Then the vector u of the z missing values is
estimated as follows

u = B^{T}a = B^{T}(A^{T})^{†}w.

This procedure has to be performed for all genes containing missing values.

In this way we get E_{estimate}.

### 7.3 Finding t

Before we can perform the algorithm described in section 7.1, we have to
find the number t. This number is not the same for every gene, because it
depends on the biological function of this gene. Moreover the samples we
use in the DNA microarray also influence t. Therefore it is very difficult
to find an algorithm that finds the best value for t. In [6] an algorithm for
finding t is proposed. This algorithm finds t using the elements in g_{k} that
are known. The algorithm works as follows. In section 7.2 we described how
we set up the matrices A and B and the vector w to compute the vector a
with the coefficients of regressing. To find t, we also use these matrices and
vectors. The vector w contains the elements of g_{k} that are known. Suppose
we pretend that w_{1} is also missing, so that we have z + 1 missing values and

we want to estimate w_{1} using the algorithm described in section 7.2. Now we
can estimate w_{1} using different values of t. In this way we can compare the
estimated value with the true value of w_{1}. Then we choose the value of t that
gives the best approximation for w_{1}, i.e. | (w_{1})_{true}− (w_{1})approximation | is as
small as possible. This procedure could be repeated for several known values
and finally we can choose the value of t that gives the best approximation.

### 7.4 Algorithm

1. Find the rows g_{k} in which values are missing.

2. Find t using the method proposed in section 7.3.

3. For each row g_{k}in which values are missing find the t most similar rows
using PCC, as described in section 7.1.

4. Create the matrices A and B (or vector b if only one value is missing) as described in section 7.2. Also create the vector w.

5. The vector u with the z missing values is calculated as u = B^{T}a =
B^{T}(A^{T})^{†}w.

6. Replace the missing values in E by the values u_{1}, u_{2}, . . . , u_{z} to get
E_{estimate}.

### 7.5 Numerical results

If we run the matlab code in appendix E 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.1461

### 8 Fixed Rank Approximation method

In this section we will explain a method called the Fixed Rank Approximation mehod, we will abbreviate this as FRAM. FRAM makes use of the SVD, as dicussed in section 2. In this section we will first explain the idea behind FRAM, then we will give an algorithm to perform this method. This method comes from [3] and [12].

### 8.1 How does it work?

In section 2 we have discussed the SVD. We have mentioned that every ma-
trix E can be decomposed as E = U ΣV^{T}, with U and V orthogonal matrices
and Σ a matrix whose off-diagonal entries are all zeros and whose diagonal
entries satisfy σ_{1} ≥ σ_{2} ≥ . . . ≥ σ_{r} > σ_{r+1} = . . . = σ_{n} = 0, the σ_{i}^{0}s are called
the singular values of E.

We have also explained that the number of nonzero singular values is equal
to the rank of the matrix E. Suppose that our expression matrix E has
missing values. Then we can remove all rows of E containing missing values,
we call the remaining matrix E^{0}. Let l^{0} denote the number of significant sin-
gular values of E^{0} (see section 6.1 for an explanation of significant singular
values). The idea behind FRAM is to find a matrix E_{estimate} of E such that
the number of signifcant singular values of E_{estimate} is still l^{0}.

We will explain why this assumption leads to reasonable estimations of the
missing values. The number of nonzero singular values of E^{0} is equal to the
rank of E^{0}. However, the singular values also tell us how ’close’ E^{0} is to a
matrix of lower rank. If the number of significant singular values is l^{0}, then
we can say that E^{0} is close to a matrix of rank l^{0}, i.e. a matrix with l^{0} inde-
pendent rows. We explained that FRAM is a method that is trying to find
a matrix E_{estimate} of E with still l^{0} singular values. In this way E_{estimate} is
also close to a matrix with rank l^{0} and consequently l^{0} independent rows. We
could also say that the dimension of the row space of E_{estimate}is almost equal
to l^{0}. In that way the completed rows of E_{estimate}, the rows that contained
missing values, are near to being dependent of the full rows, the rows of E^{0}.
If we want the number of significant singular values to be equal to l^{0}, then we
have to find E_{estimate}such that σ_{l}^{0}_{+1}, σ_{l}^{0}_{+2}, . . . , σ_{r} (singular values of E_{estimate})
are minimal, where r is the rank of E_{estimate}.

So we want to find a matrix E_{estimate} such that σ_{l}^{0}_{+1}, σ_{l}^{0}_{+2}, . . . , σ_{r} are as

small as possible. We can also say that PN

k=l^{0}+1σ_{k}^{2} should be minimized.

From [3] we have an algorithm to find E_{estimate} where PN

k=l^{0}+1σ_{k}^{2} is mini-
mized. The algorithm works iteratively. In every step PN

k=l^{0}+1σ^{2}_{k} will be
smaller. In this way we need a stopping criterion to stop the algorithm. Here
we again use the Frobenius norm as stopping criterion, see section 6.3.1. Now
we will explain the algorithm and show that

N

X

k=l^{0}+1

σ_{k,E}^{2} _{p} ≥

N

X

k=l^{0}+1

σ_{k,E}^{2} _{p+1}.

We will follow the approach of [3] and [12]. Because we need to compute the
SVD of E, we have to assign specific numbers to the empty entries. These
numbers could be zeros (according to section 4) or numbers coming from
another method, this matrix is called E_{0}.

Let Ep+1 = Ep + Xp+1, with Xp+1 = (xij)^{M,N}_{i,j=1} where xij = 0 if element
(i, j) is known and x_{ij} a free variable if (i, j) is missing in E.

Let x = (x_{j}_{1}_{i}_{1}, x_{j}_{2}_{i}_{2}, . . . , x_{j}_{z}_{i}_{z})^{T} denote the z × 1 vector whose entries are in-
dexed by the coordinates of the missing values in E. In this way there exists
a z × z real valued symmetric nonnegative matrix B_{p} such that

x^{T}B_{p}x =

N

X

k=l^{0}+1

v^{T}_{p,k}X_{p+1}^{T} X_{p+1}v_{p,k},

where v_{p,k} is the eigenvector of A_{p} corresponding to λ_{k,A}_{p}. The (s, t) entry of
B_{p} is given by

(B ) = 1 X^{N}

v^{T} (F (i , j )^{T}F (i, j ) + F (i, j )^{T}F (i , j ))v , where s, t = 1, . . . , z.

Now let A_{p} = E_{p}^{T}E_{p}, then

N

X

k=l^{0}+1

v^{T}_{p,k}E_{p+1}^{T} E_{p+1}v_{p,k}=

N

X

k=l^{0}+1

v_{p,k}^{T} (E_{p}+ X_{p+1})^{T}(E_{p}+ X_{p+1})v_{p,k}

=

N

X

k=l^{0}+1

v_{p,k}^{T} (E_{p}^{T}E_{p}+ E_{p}^{T}X_{p+1}+ X_{p+1}^{T} E_{p}+ X_{p+1}^{T} X_{p+1})v_{p,k}

=

N

X

k=l^{0}+1

v_{p,k}^{T} (A_{p}+ E_{p}^{T}X_{p+1}+ X_{p+1}^{T} E_{p}+ X_{p+1}^{T} X_{p+1})v_{p,k}

=

N

X

k=l^{0}+1

λ_{k,A}_{p}+ x^{T}B_{p}x + 2w_{p}^{T}x,
where

wp = (wp,1, wp,2, . . . , wp,z)^{T}
and

w_{p,t}=

N

X

k=l^{0}+1

v_{p,k}^{T} E_{p}^{T}F (i_{t}, j_{t})v_{p,k}, t = 1, . . . , z

In the above λ_{k,A}_{p} is the k^{th} eigenvalue of A_{p} if λ_{1} ≥ λ_{2}. . . ≥ λ_{r}. We know
that PN

k=l^{0}+1v_{p,k}^{T} E_{p+1}^{T} E_{p+1}v_{p,k} is always nonnegative. If we want to minimize
this expression, we have to minimize x^{T}B_{p}x + 2w_{p}^{T}x. Differentiating this
expression gives 2x^{T}B_{p} + 2w^{T}_{p}. Setting this equal to zero gives that the
minimum is achieved when

Bpxp+1 = −wp.

In [3] it is proved that B_{p} is always nonsingular, consequently this equation
can be solved easily. The vector x_{p+1} gives the entries for X_{p+1} Because λ_{k,A}_{p}
is the k^{th} eigenvalue of A_{p}, we have

σ_{k,E}^{2} _{p} = λ_{k,A}_{p}.
From this we see that

Apvp,k= λk,Apvp,k

and consequently

v_{p,k}^{T} A_{p}v^{T}_{p,k}= λ_{k,A}_{p}

= σ^{2}_{k,E}_{p}.

Using this and the fact E_{p+1}= E_{p}+ X_{p+1} we have

N

X

k=l^{0}+1

σ^{2}_{k,E}

p =

N

X

k=l^{0}+1

v_{p,k}^{T} A_{p}v_{p,k}^{T}

=

N

X

k=l^{0}+1

v_{p,k}^{T} (E_{p} + 0)^{T}(E_{p}+ 0)v_{p,k}

≥

N

X

k=l^{0}+1

v_{p,k}^{T} (E_{p}+ X_{p+1})^{T}(E_{p}+ X_{p+1})v_{p,k}

=

N

X

k=l^{0}+1

v_{p,k}^{T} A_{p+1}v_{p,k}

≥

N

X

k=l^{0}+1

λ_{k,A}_{p}

=

N

X

k=l^{0}+1

σ_{k,E}^{2} _{p+1}.

The first ≥ sign comes from the fact that x_{p+1}minimizes the expression. The
second ≥ sign comes from the Ky-Fan characterization [4]. This character-
ization will not be explained in this thesis, therefore we refer to [4]. So we
see that indeed in every iteration PN

k=l^{0}+1σ_{k,E}^{2} _{p} becomes smaller.

In the next section section we will summarize this algorithm step by step.

### 8.2 The algorithm

1. Compute l^{0}, the number of significant singular values of E^{0}.

2. Find the missing values of E and let x = (x , x , . . . , x )^{T} denote

7. Let E_{p+1} = E_{p} + X_{p+1}, where x_{p+1} contains the entries for X_{p+1}, as
explained in section 8.1.

8. Repeat this procedure (from step 4) until E_{p+1} does not change any-
more (according to the Frobenius norm). The final matrix Ep+1 is
E_{estimate}.

### 8.3 Numerical results

If we run the matlab code in appendix F 5 times for 1%, 5%, 10%, 15%

and 20% of missing values, the mean of the errors e (see section 3) are the following

Percentage e

1% 0.1540

5% 0.3241

10% 0.4875

15% 0.5184

20% 0.7013

### 9 Conclusion

If we summarize all the numerical results, we have the following

Percen Zerorepl Row smean Col smean SVD1 SVD2 LS FRAM 1% 0.3920 0.1110 0.1096 0.1243 0.6975 0.1461 0.1540 5% 0.8623 0.2636 0.2583 0.2478 1.5759 0.4163 0.3241 10% 1.2417 0.4140 0.3590 0.3713 1.9243 0.6585 0.4875 15% 1.5316 0.4633 0.4353 0.4567 2.2972 0.8785 0.5184 20% 1.7673 0.5016 0.5472 0.6825 2.4120 1.0643 0.7013

(e) Error e for all methods for different percentages of missing values

### 10 Discussion

In section 9 we saw that for all percentages the row and column sum mean method work very well. For that reason we could say that it is not necessary to use such ’difficult’ methods as proposed in this thesis to estimate missing values in DNA microarrays. However, this conclusion cannot be stated this simple. In our testing matrix, we use only 1000 rows of the in total 5986 rows.

If we would use the full matrix, it is reasonable that our methods would give
different results. For example, if we look at the LS method, we see that this
method gives very bad results. From section 7 we know that this method
estimates missing values making use of rows that are very similar to the row
in which the value is missing. If we use only ^{1}_{6} part of the total matrix, it is
reasonable that in the other part there are rows that are more similar to the

’target’ row than the rows we use. So we can expect that the error for the LS method would be smaller if we would use the full matrix. Unfortunately we cannot test this because matlab does not work with such a matrix.

Looking at the two methods based on the SVD, we see that the first method works significantly better than the second method. This is very peculiar because in section 6 we have mentioned that in our opinion there is not a reason that method 1 would work. Moreover, we have explained how method 2 works, but apparently this methods does give good results. Mathemati- cally we still cannot explain why method 1 works this well. Probably we need more biological understanding about DNA microarrays to be able to explain this. In [3] they mention that only a few eigengenes are needed to reveal the whole matrix E. This has been found by researchers that have worked with DNA microarrays. However, this fact has not been explained yet. Probably this can be done looking at the biological functions of eigengenes, but this falls out of the scope of this thesis.

Because we used the same error as in literature, we can also compare our results to the results in literature. In [3] they have tested the FRAM method against the row sum mean method. We see that their results for the row sum mean method are the same as our results. The results for FRAM differ.

For small percentages of missing values (1%-10%) the results are the same.

However, in [3] FRAM gives smaller errors than we have. The reason for this is probably that they have used the full matrix. So we see that FRAM would give better results when the full matrix is used. Unfortunately we have not been able to find literature in which they have tested the other methods.

### A DNA microarray

Before performing the methods, we have to load the data. The data is available online [11]. We saved the data as an xlsx-file. The matrix is an 5986 × 17 matrix. This matrix is to big to work with in matlab. Therefore we have to remove rows to remain a matrix that is not too big for matlab.

We removed the last 4987 rows, the remaining matrix is of size 1000 × 17, i.e. a matrix with 1000 rows (genes) and 17 columns (samples). We save this as dnamicroarraytest.xlsx. We load the data in matlab as follows

%r e a d i n g t h e d a t a

datadna=x l s r e a d ( ’ d n a m i c r o a r r a y t e s t . x l s x ’ ) ;

%remove rows w i t h any NaN ( any empty e n t r y i s f i l l e d w i t h NaN) E f u l l=datadna ( ˜ any ( i s n a n ( datadna ) , 2 ) , : ) ;

E=E f u l l ;

%c r e a t e m a t r i x w i t h NaN’ s ( randomly ) NUM=numel (E ) ;

%p e r c e n=p e r c e n t a g e o f v a l u e s t h a t i s m i s s i n g

%A number s h o u l d be a s s i g n e d t o p e r c e n

E( randperm (NUM, f l o o r ( (NUM/ 1 0 0 ) ∗ p e r c e n ) ) = NaN ;

### B Zero replacement method

The matlab code for the zero replacement method is the following f u n c t i o n e r r o r = z e r o r e p l (E , E f u l l )

[M,N]= s i z e (E ) ;

%r e p l a c e m i s s i n g v a l u e s by z e r o s E( i s n a n (E) ) = 0 ;

d i f f e r e n c e=E f u l l −E ;

%f i n d NaN e n t r i e s E nan=i s n a n (E ) ;

f o r i =1:M

%R e p a l c e NaN by mean o f t h a t row

E( i , E nan ( i , : ) ) = nanmean (E( i , : ) ) ; end

%f i n d d i f f e r e n c e e s t i m a t i o n and f u l l m a t r i x d i f f e r e n c e=E−E f u l l ;

%compute e r r o r

e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;

### C.2 Column sum mean method

The matlab code for the column sum mean method is the following f u n c t i o n e r r o r = columnsummeanmethod (E , E f u l l )

%g e t s i z e o f E [M,N]= s i z e (E ) ;

%f i n d NaN e n t r i e s E nan=i s n a n (E ) ;

f o r i =1:N

%R e p l a c e NaN by mean o f t h a t column E( E nan ( : , i ) , i ) = nanmean (E ( : , i ) ) ; end

%f i n d d i f f e r e n c e e s t i m a t i o n and f u l l m a t r i x d i f f e r e n c e=E−E f u l l ;

%compute e r r o r

e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;

### D Singular Value Decomposition methods

### D.1 Estimation method 2 using SVD

The matlab code for the first singular value decomposition method is the following

f u n c t i o n e r r o r = svd1 (E , E f u l l , i t e r ) [M,N]= s i z e (E ) ;

%r e p l a c e NaN by z e r o

E=E ; Ep=E ;

Ep ( i s n a n (E) ) = 0 ; e r r =1;

i n d e x=f i n d ( i s n a n (E ) ) ; w h i l e ( e r r >0.01& i t e r >0)

%compute svd o f Ep [ U, S ,V]= svd ( Ep ) ; r=rank ( Ep ) ;

sigma=S ( S ˜ = 0 ) ; s i g m a s q r t=s q r t ( S ) ;

t o t a l s i n g u l a r=sum ( s i g m a s q r t ) ;

%compute f r a c t i o n

f r a c t i o n=s i g m a s q r t / t o t a l s i n g u l a r ;

%number o f s i g n i f i c a n t v a l u e s s u m f r a c=cumsum ( f r a c t i o n ) ; g r e a t e r=f i n d ( sumfrac >=0.7);

s=g r e a t e r ( 1 ) ;

[ U1 , S1 , V1]= s v d s ( Ep , s ) ;

%compute a p p r o x i m a t i o n Elow=U1∗ S1 ∗ t r a n s p o s e (V1 ) ;

f o r i =1: l e n g t h ( i n d e x )

Ep ( i n d e x ( i ))= Elow ( i n d e x ( i ) ) ; end

e r r=norm ( Ep−Elow , ’ f r o ’ ) ; i t e r =i t e r −1;

end Ep ;

d i f f e r e n c e=E f u l l −Ep ;

e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;

[M,N]= s i z e (E ) ;

%r e p l a c e NaN by z e r o Emiss=E ;

E=E ; Ep=E ;

Ep ( i s n a n ( Emiss ) ) = 0 ; e r r =1;

w h i l e ( e r r >0.01& i t e r >0)

%computer rank and SVD Ep r=rank ( Ep ) ;

[ U, S ,V]= s v d s ( Ep , r ) ; T=abs (V ) ;

f o r i =1:M

%f o r e a c h row f i n d m i s s i n g v a l u e s i n d e x r o w=f i n d ( i s n a n ( Emiss ( i , : ) ) ) ; x=t r a n s p o s e (E( i , : ) ) ;

%compute v1 , . . , v r w i t h v a l u e s d e l e t i n g c o r r e s p o n d i n g

%t o m i s s i n g

%v a l u e s

f o r j =1: l e n g t h ( i n d e x r o w )

T( ( i n d e x r o w ( j )−( j − 1 ) ) , : ) = [ ] ; x ( ( i n d e x r o w ( j ) −( j − 1 ) ) ) = [ ] ; end

%compute c o e f f i c i e n t s a=T\x ;

y=E( i , : ) ;

x=t r a n s p o s e ( y ) ; [ U, S ,V]= s v d s ( Ep , r ) ; T=abs (V ) ;

%compute e s t i m a t i o n s f o r m i s s i n g v a l u e s f o r j =1: l e n g t h ( i n d e x r o w )

q=T( i n d e x r o w ( j ) , : ) ;

x ( i n d e x r o w ( j ))= t r a n s p o s e ( a ) ∗ t r a n s p o s e ( q ) ; end

E( i , : ) = t r a n s p o s e ( x ) ; end

e r r=norm ( Ep−E , ’ f r o ’ ) ; Ep=E ;

i t e r =i t e r −1;

end

Ep ;

d i f f e r e n c e=E f u l l −Ep ;

%compute e r r o r

e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;

### E Least Squares imputation method

For the LS method we have described an algorithm to find t (see section 7.3).

We will not implement this algorithm in matlab. The best value for t will be found be looking at the results we get from the code. The matlab code for the least squares imputation method is the following

f u n c t i o n e r r o r = LS (E , E f u l l , t )

%s e l e c t i n g t g e n e s [M,N]= s i z e (E ) ;

%r e p l a c e NaN by z e r o Emiss=E ;

E=E ; e r r =1;

Ep=E ;

Ep ( i s n a n ( Emiss ) ) = 0 ; E l a t=Ep ;

numremoved =0;

f o r i =1:M

%f o r e a c h row f i n d m i s s i n g v a l u e s Ep=E l a t ;

i n d e x r o w=f i n d ( i s n a n ( Emiss ( i , : ) ) ) ; numremoved =0;

numforremoved =0;

i f i s e m p t y ( i n d e x r o w)==0