faculteit Wiskunde en Natuurwetenschappen
Missing value estimation for DNA microarrays
Bachelor thesis Mathematics
June 2013
Student: F. Doddema
First supervisor: Prof. dr. H.L. Trentelman Second supervisor: Prof. dr. J. Top
Abstract
A DNA microarray is a biological tool to measure the activity of genes in different samples. These samples are, for example, coming from a tumor. A DNA microarray can measure simultaneously the activity for M genes in N samples, where M ≈ 20, 000 and N ≈ 20.
The results can be given as an M ×N matrix, the expression matrix E.
Sometimes elements in this matrix are missing due to various factors, for example dust on the chip or machine error. In this thesis we describe several methods to estimate the missing values. We compare these methods using some DNA microarray (online available).
Contents
1 Introduction 5
1.1 Genes, DNA and RNA . . . 5
1.2 DNA microarray . . . 6
1.3 Missing values in DNA microarrays . . . 8
2 Mathematical background 9 3 Structure of the subsequent sections 13 4 Zero replacement method 14 4.1 How does it work? . . . 14
4.2 Algorithm . . . 14
4.3 Numerical results . . . 14
5 Row and column sum mean method 15 5.1 How does it work? . . . 15
5.1.1 Row sum mean method . . . 15
5.1.2 Column sum mean method . . . 15
5.2 Algorithm . . . 15
5.2.1 Row sum mean method . . . 15
5.2.2 Column sum mean method . . . 15
5.3 Numerical results . . . 16
5.3.1 Row sum mean method . . . 16
5.3.2 Column sum mean method . . . 16
6 Singular Value Decomposition methods 17 6.1 SVD and Expression matrix . . . 17
6.2 How does it work? . . . 19
6.2.1 Estimation method 1 using SVD . . . 19
6.2.2 Estimation method 2 using SVD . . . 20
6.3 Algorithm . . . 20
6.3.1 Estimation method 1 using SVD . . . 20
6.3.2 Estimation method 2 using SVD . . . 21
6.4 Numerical results . . . 22
6.4.1 Estimation method 1 using SVD . . . 22
6.4.2 Estimation method 2 using SVD . . . 22
7 Least Squares imputation method 23
7.1 Selecting t genes . . . 23
7.2 Regressing with LSimputation . . . 24
7.3 Finding t . . . 25
7.4 Algorithm . . . 26
7.5 Numerical results . . . 26
8 Fixed Rank Approximation method 27 8.1 How does it work? . . . 27
8.2 The algorithm . . . 30
8.3 Numerical results . . . 31
9 Conclusion 32 10 Discussion 33 A DNA microarray 34 B Zero replacement method 34 C Row and column sum mean method 34 C.1 Row sum mean method . . . 34
C.2 Column sum mean method . . . 35
D Singular Value Decomposition methods 35 D.1 Estimation method 2 using SVD . . . 35
D.2 Estimation method 2 using SVD . . . 36
E Least Squares imputation method 38
F Fixed Rank Approximation method 40
1 Introduction
Terms as genes and DNA are used a lot in daily life. It is used to identify suspects of criminal cases, but DNA also tells us a lot about our body. In DNA one can, for example, find the colour of a person’s eyes and hair. Ac- tually one’s whole appearance is described in his DNA.
DNA cannot only tell something about how someone looks like, it can also tell something about the inside of a person’s body. In this way researchers try to develop drugs against, for example, cancer. In section 1.1 we will give an introduction of genes, DNA and RNA. In section 1.2 we will explain a method, called DNA microarray, that is used to understand the working of genes in particular diseases. In the last section of this introduction, section 1.3, we will explain which problem we will discuss in this thesis and how we will try to solve it with mathematics.
1.1 Genes, DNA and RNA
Looking at a human being we see one body. However, if we would look at a human body with a microscope, we would see that it exists of a lot of cells. The average size of a human cell is 15µm (15 × 10−6m). Each cell contains cytoplasm, a fluid that contains all the organels of a cell and the cytoplasm is surrounded by a membrane. Some examples of organels are mitochondria, ribosomes and the nucleus, see figure 1(a). The nucleus of a cell contains DNA, which is a molecule that is made of two complementary strands, which form a double helix. Each strand is formed by 4 kinds of bases, called C, G, A and T (cytosine, guanine, adenine and thymine). The two strands are held together by hydrogen bonds, which connect matching pairs of bases, C is connected with G and A with T (see figure 1(b)). The nucleus contains a lot of DNA, which is spread among 46 chromosomes. The DNA exists of multiple segments which are called genes. Every cell of a hu- man being contains approximately 30,000 to 40,000 genes. This means that every cell of a human body contains all genes (a cell in the foot, for example, also contains DNA that codes for the colour of your eyes). The particular sequence of bases in a gene plays a very important role in life. The reason for this is that the particular sequence corresponds to a particular protein.
An example of a protein is hemoglobin, a protein that is involved in the transport of oxygen and carbon dioxide in blood.
(a) Human cell [10] (b) DNA molecule[1]
We hope that it is clear that proteins are very important molecules in life.
Now we will explain how proteins are produced. As said before, DNA that codes for proteins is situated in the nucleus of a cell. However, proteins are not produced in the nucleus, they are produced by ribosomes, an organel outside the nucleus in the cytoplasm. For this reason we would say that the segment of DNA shoud be ripped out the nucleus and transferred to a ribosome. However, this process is a little complicater. When a gene receives a command to be coded for a protein, the segment of DNA that is needed opens, such that we have two strands of DNA. Then, one strand is copied, the copy is called messenger RNA (mRNA), this process is called transcription. This piece of mRNA goes out of the nucleus to a ribosome where the particular sequence of bases codes for a protein, this process is called translation. Figure 1(c) describes the mechanism of transcription and translation. The collection of all genes that code for proteins is called the genome. [9]
(c) Transcription and Translation[2]
thousands of genes in a cell, a DNA microarray is used. A DNA microarray is a chip with thousands of ’holes’ in it. In every ’hole’ we can find a sequence of single stranded DNA that corresponds to a gene. When we have some cells of, for example, a tumor, the mRNA in that cells will be extracted and labeled fluorescently. This RNA is spread over the chip. When a RNA molecule from the tumor encounters a fragment of DNA that matches, then it will form a double helix with that DNA fragment. Genes that do not correspond to the RNA will stay single stranded. Then the chip is washed off such that only the molecules that found their perfect match will remain stuck to the chip.
Using a laser we can measure the expression level of each gene, the result looks like figure 1(d). This can be done because the RNA from the tumor was marked fluorescently. For a more detailled description of the working of a DNA microarray, we refer to [7]. We described the basic principles of a DNA microarray for only one sample (the tumor). In practice a DNA microarray is done simultaneously for more than one sample, for example 100 samples. The DNA microarray contains the DNA for approximately 20,000 genes. The result of this DNA microarray can be translated to a M × N matrix (rectangular array with M rows and N columns), where each row corresponds to a gene and each column to a sample. [9] [7]
(d) DNA microarray [7]
1.3 Missing values in DNA microarrays
Sometimes elements in the expression matrix are missing due to various fac- tors, for example dust on the chip or machine error. In this thesis we will discuss several methods to estimate the missing values in the expression ma- trix E. We will discuss the following methods:
1. Zero replacement method
2. Row and column sum mean method 3. Singular Value Decomposition methods 4. Least Squares imputation method 5. Fixed Rank Approximation method
In section 2 we will give some mathematical background that is needed to understand the contents of this thesis. In section 3 we will discuss the struc- ture of sections 4 to 8. Here we will explain how we will discuss every method and how we will test it. In section 9 we will compare all the methods based
2 Mathematical background
In this section we will explain all the mathematical background that is needed to understand the contents of this thesis.
In this thesis we deal with real matrices. A real matrix is a rectangular array of real numbers. The size of a matrix is denoted by M × N , where M is the number of rows and N is the number of columns. For a matrix E, we denote the element in row i and column j by Eij. In this thesis E is the expression matrix were Eij stands for the expression level of gene i in sample j. When E is of size M × N , this means that we deal with M genes and N samples.
Let us call row i of E gi and column j cj. In this way gi gives the expression levels of the N samples for gene i and cj gives the expression levels for the M genes in sample j. So E can be written as
E =
g1 g2 ... gM
= c1 c2 . . . cN .
Definition 1. Let E be an M × N matrix. The row space of E is the subspace of <N spanned by the row vectors of E. Similarly the column space of E is the subspace of <M spanned by the column vectors of E.
Definition 2. Let E : V → W be a linear transformation and let S be a subspace of V . The image of S, denoted L(S), is defined by
L(S) = {w ∈ W | w = L(v) for some v ∈ S}.
The image of the entire vector space, L(V ), is called the range of L.
Using definition 2 [8] we can think of a matrix E as the linear transformation mapping <N to <M. In that way we see that the column space of E is exactly the same as the range of E, R(E). In this thesis we will denote the column space of E by R(E). The row space of E will be denoted by R(ET), because the column space of ET is the same as the row space of E.
Definition 3. Let x1,x2,. . . ,xnbe nonzero vectors in an innner product space V . If hxi, xji = 0 whenever i 6= j, then x1, x2, . . . , xn is said to be an orthogonal set of vectors.
Definition 4. An orthonormal set of vectors is an orthogonal set of unit vectors, where unit vectors are vectors of length 1.
Definition 5. An M × M matrix U is called an orthogonal matrix if the column vectors of U form an orthonormal set in <M.
Definition 6. λ and x 6= 0 are respectively an eigenvalue and an eigenvector of a matrix E if Ex = λx.
Theorem 1. If A is a real symmetric matrix, then the rank of A is the same as the number of nonzero eigenvalues (taking into account their algebraic multiplicities).[8]
Theorem 2 (The SVD Theorem). If E is an M × N matrix, then E can be factorized into U ΣVT where U is an M × M orthogonal matrix, V is an N × N orthogonal matrix and Σ is an M × N matrix whose off-diagonal entries are all zeros and whose diagonal elements satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > σr+1 = . . . = σmin(N,M ) = 0, where r denotes the rank of E. This decomposition is called the singular value decomposition.
In our case E is the expression matrix were M denotes the number of genes and N the number of samples. In general we can assume that the number of genes is much greater than the number of samples, so we can state M ≥ N . In that case we can say that the singular values satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > σr+1 = . . . = σN = 0. This will be a standing assump- tion throughout this thesis. Moreover we denote the rank of E by r.
The proof of this theorem falls out of the scope of this thesis. For a proof we refer to [8]. In the proof one constructs the matrices U , Σ and VT to show that such a decomposition exist. Here we will only mention how to construct these matrices.
Let λ1, λ2, . . . , λN be the eigenvalues of ETE, such that
Next let Σ be the M × N matrix defined by
Σ =Σ1 O
O O
with Σ1 =
σ1
σ2 . ..
σr
(4)
where r denotes the rank of E. Because E is of rank r, ETE is also of rank r. Moreover, ETE is symmetric, so the number of nonzero eigenvalues (and consequently its number of nonzero singular values) equals its rank, see theorem 1. Now define
uj = 1
σjEvj for j = 1, . . . , r. (5) Let ur+1, ur+2, . . . , uM be an orthonormal basis for N (ET) and set
U = (u1, u2, . . . , ur, ur+1, . . . , uM). (6) So U is an M × M orthogonal matrix.
If we define U , Σ and V is this way, then
E = U ΣVT. (7)
We now list some properties of the singular value decomposition without proving them. For a proof we refer to [8].
1. The vj’s are called the right singular vectors of E, and the uj’s are called the left singular vectors of E.
2. If E has rank r, then
(a) v1, . . . , vr form an orthonormal basis for R(ET).
(b) vr+1, . . . , vN form an orthonormal basis for N (E).
(c) u1, . . . , ur form an orthonormal basis for R(E).
(d) ur+1, . . . , uM form an orthonormal basis for N (ET).
Theorem 3. If E is an M × N matrix, then there exists a unique N × M matrix E† that satisfies the following conditions:
1. EE†E = E 2. E†EE†= E†
3. (EE†)T = EE† 4. (E†E)T = E†E
E†is called the Moore-Penrose pseudoinverse of E and is computed as follows E†= V Σ−11 O
O O
UT,
where U , V and Σ1 are matrices coming from the SVD of E.[8]
3 Structure of the subsequent sections
In the next sections we will discuss the methods that we have mentioned in section 1.3. Before starting to discuss the first method we will explain how we set up each of these sections. In each section we will first discuss how the method works. We will discuss the mathematical principle and give an algorithm for the method.
Next, we will perform the method on some testing expression matrix E from a DNA microarray. This testing matrix is available online[11]. From the testing matrix we only use those genes (rows) and samples (columns) that contain all entries, such that we have no missing values. This matrix is what we call our full testing matrix Ef ull. Because we try to estimate missing val- ues, we remove randomly entries from Ef ull, the remaining matrix is denoted by E. In every method we try to estimate the missing values, the matrix we obtain is Eestimate. Because we also have Ef ull, we can compute the error between Ef ull and Eestimate. This error is computed with the Frobenius norm of Ef ull − Eestimate divided by the mean of all elements of Ef ull, µ and the squareroot of the number of rows of E, √
M . We use this error because this error is also used in literature, see for example [3]. In this way we can easily compare our results to results from literature. Denoting the error by e, we have
e = 1 µ√
MkEf ull − EestimatekF
= 1
µ√ M
q
hEf ull − Eestimate, Ef ull − Eestimatei
= 1
µ√ M
v u u t
M
X
i=1 N
X
j=1
((Ef ull)ij − (Eestimate)ij)2.
Using this error we can compare the methods and show which method gives the best results, i.e. the smallest error. This will be done in section 9.
4 Zero replacement method
The zero replacement method is the first method we will discuss. The zero replacement method is a very easy method to understand, but does not give good results as we will see later on.
4.1 How does it work?
If elements of our expression matrix E are missing, this method tells us that we should place zeros on the place of the missing elements. In this way the other elements of the expression matrix are not used and that is why this method does not give very good results.
4.2 Algorithm
The algorithm for this method is as follows
1. Replace all the missing values in E by zeros, this gives Eestimate.
4.3 Numerical results
If we run the matlab code in appendix B 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.3920
5% 0.8623
10% 1.2417
15% 1.5316
20% 1.7673
5 Row and column sum mean method
The second and third method to estimate the missing values are the Row and Column sum mean method.
5.1 How does it work?
5.1.1 Row sum mean method
In this method missing values are estimated by the mean of present elements in the same row of the missing element. In this way an element is estimated by the expression levels of the same gene in other samples.
5.1.2 Column sum mean method
This method works almost the same as the Row sum mean method, how- ever now missing values are estimated by the mean of the elements in the same column. This means that an element is estimated by the mean of the expression levels of other genes in the same sample.
5.2 Algorithm
5.2.1 Row sum mean method
1. Replace missing values in E by the mean of non missing elements in the same row. So if Eij is missing, replace Eij by the mean of the non missing element in gi. Doing this for all the missing elements gives Eestimate.
5.2.2 Column sum mean method
1. Replace missing values in E by the mean of non missing elements in the same column. So if Eij is missing, replace Eij by the mean of the non missing element in cj. Doing this for all the missing elements gives Eestimate.
5.3 Numerical results
5.3.1 Row sum mean method
If we run the matlab code in appendix C.1 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.1110
5% 0.2636
10% 0.4140
15% 0.4633
20% 0.5016
5.3.2 Column sum mean method
If we run the matlab code in appendix C.2 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.1096
5% 0.2583
10% 0.3590
15% 0.4353
20% 0.5472
6 Singular Value Decomposition methods
In section 2 we have discussed the basic principles of the Singular Value Decomposition. In the remaining part of this thesis this decomposition will be abbreviated as the SVD. In this section we will discuss the application of the SVD in the estimation of missing values in an expression matrix E. In the first part of this section we will give some general information about he application of the SVD in estimating missing values of our matrix E. In the second and third part we will discuss two methods, both using the SVD, for estimating the missing values in matrix E. The first method is discussed in [3] and the second in [13].
6.1 SVD and Expression matrix
In section 2 we have discussed the SVD. We have shown that our M × N expression matrix E (with M ≥ N ) can be factorized as E = U ΣVT with U and V orthogonal matrices and Σ a matrix whose off-diagonal entries are all zero and whose diagonal elements satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > σr+1 = . . . = σN = 0 (with r the rank of E).
In the expression matrix E we assume that some elements are missing and we want to estimate these elements. However, in order to compute the SVD of the matrix, the matrix should contain all of its elements. That is why we have to assign specific numbers to the missing values before we can compute the SVD. We can choose to replace the missing values with zero (according to the zero replacement method, section 4). We could also choose to fill the missing entries with the mean of the rows or the columns (see section 5). Values computed from another method could also be used. Having done that, we can compute the SVD of E.
If E = U ΣVT, we denote the columns of U as u1, u2, . . . , uM (the left singular vectors) and the columns of V as v1, v2, . . . , vN (the right singular vectors).
We know that if the matrix E has rank r, then σi = 0 for i = r + 1, . . . , N .
Using these properties, we can write E = U ΣVT as E = U ΣVT
= (u1, u2, . . . , uM)
σ1
σ2
. ..
σr . ..
σN
v1T v2T ... vNT
= σ1u1v1T + σ2u2v2T + . . . + σrurvrT + σr+1ur+1vr+1T + . . . + +σNuNvTN
= σ1u1v1T + σ2u2v2T + . . . + σrurvTr
=
r
X
i=1
σiuiviT.
In this way we see that E can be written as U1Σ1V1T with
U1 = (u1, u2, . . . , ur), Σ1 =
σ1
σ2
. ..
σr
and V1 = (v1, v2, . . . , vr).
In section 2 we have mentioned that u1, u2, . . . , ur form an orthonormal basis for R(E), the column space of E. E is our expression matrix and in this ma- trix column i stands for the expression levels of all M genes in sample i. From this perspective we can say that u1, u2, . . . , ur form an orthonormal basis for the subspace of <M spanned by the columns of E. Similarly we can say that vT1, vT2, . . . , vrT form an orthonormal basis for the subspace of <1×N spanned
s the number of significant eigengenes. The corresponding eigensamples and singular values are called respectively significant eigensamples and sifnificant singular values. There are several methods to calculate s. Three, mostly used, methods are
1. fraction criterium 2. scree plots for σq2 3. broken stick model.
Here we will only discuss the first method. For the other methods we refer to [5]. The fraction criterium works as follows. For all r eigengenes we compute the fraction pq
pq = σ2q Pr
j=1σj2, where q = 1, . . . , r.
In this way pq is a measure for the fraction of the expression level contributed by the qth eigengene [3]. Once pq is computed for q = 1, 2, . . . , r, we know that
pq1 ≥ pq2 ≥ . . . ≥ pqr
The s significant eigengenes are the s eigengenes for which
s
X
k=1
pqk ≥ 0.7, moreover
s−1
X
k=1
pqk < 0.7.
These significant eigengenes will be used to estimate the missing values in the two methods that will follow in next part of this section.
6.2 How does it work?
6.2.1 Estimation method 1 using SVD
This method comes from [3] and in our opinion it is unclear why this method would work. For that reason we will only give the algorithm and will not explain it.
6.2.2 Estimation method 2 using SVD
In this method we use a property of the SVD as mentioned in section 2. There it was claimed that if we decompose E into U ΣVT, then v1, v2, . . . , vr form an orthonormal basis for R(ET). In our case row i gives the expression levels of gene i for the N samples. Consequently the (row)vectors v1T, v2T, . . . , vTr form an orthonormal basis for the subspace of <N spanned by the rows of E. For this reason every row gk of E can be written as a linear combination of vT1, vT2, . . . , vrT. This means that we can find α1, α2, . . . , αr such that
gk = α1v1T + α2v2T + . . . + αrvrT.
In this method we will use only those elements of gk that are known and the corresponding elements of v1T, v2T, . . . , vTr to find α1, α2, . . . , αr. These coefficients will be used to estimate the missing value(s) in gk using the corresponding elements in v1T, v2T, . . . , vrT. This will be done for all the missing elements of E. This whole procedure will be repeated until there is almost no change in Eestimate
6.3 Algorithm
6.3.1 Estimation method 1 using SVD
1. Replace the missing values with zeros or values from an other method.
This has been discussed in section 6.1. We call this matrix E0.
2. Calculate the matrices U , Σ and V used in the SVD for the matrix Et (starting with t = 0).
3. Calculate the fraction pq for q = 1, . . . , r and find in this way the s significant singular values.
is smaller than a certain number, for example 0.01, we stop the procedure.
The difference in Frobenius norm is calculated as follows
kEp+1− EpkF = q
hEp+1− Ep, Ep+1− Epi = v u u t
M
X
i=1 N
X
j=1
((Ep+1)ij − (Ep)ij)2.
We can also use diffferent stopping criteria, but in this thesis we will only use the Frobenius norm.
6.3.2 Estimation method 2 using SVD
1. Replace the missing values with zeros or values from an other method.
This has been discussed in section 6.1. We call this matrix E0.
2. Calculate the matrix V as used in the SVD for the matrix Et (starting with t = 0).
3. Extract the vectors v1, v2, . . . , vr from V where r denotes the rank of Et.
4. For each missing value write the row in which the missing value occurs as a linear combination of v1T, vT2, . . . , vrT. So if Ekq is missing, let gk0 be row gk where the qth element is removed and let vT10, v2T0, . . . , vrT0 be the vectors v1T, vT2, . . . , vrT where the qth is removed in every vector. Then we should write row g0k as a lineair combination of vT10, v2T0, . . . , vrT0, i.e.
gk0 = α1v1T0+ α2v2T0+ . . . + αrvrT0.
5. Using the coefficients α1, α2, . . . , αr, we should estimate the qth entry gk using the qth entries of v1T, v2T, . . . , vrT.
6. Steps 4 and 5 should be done for all missing entries in Et.
7. Replace the missing values in Et by the estimated values computed in steps 4, 5 and 6 and set Et+1= Et.
8. Repeat this procedure until the matrix does not change. Here we use again the Frobenius norm to compare Et+1 with Et.
9. The final matrix Et+1 is our Eestimate.
6.4 Numerical results
6.4.1 Estimation method 1 using SVD
If we run the matlab code in appendix D.1 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.1243
5% 0.2478
10% 0.3713
15% 0.4567
20% 0.6825
6.4.2 Estimation method 2 using SVD
If we run the matlab code in appendix D.2 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.6975
5% 1.5759
10% 1.9243
15% 2.2972
20% 2.4120
7 Least Squares imputation method
In this section we describe the method of least squares imputation, abbrevi- ated as LSimputation. We follow the approach in [6]. In this method we can distinquish two steps. In the first step we select t rows (genes) that are very similar to the row in which a value (or values) is (or are) missing, say row gk. In the second step we will regress gk against t rows and use these coefficients of regression to estimate the missing value(s). To determine these coefficients of regression, we use a method based on least squares. Before performing step one, we have to find the number t. How t is found is described in section 7.3.
7.1 Selecting t genes
Here we will describe how we select the t genes mentioned above (assuming that t is known). In the literature different methods are described to find these t genes. Here we describe two methods: the first using the so called Pearson correlation coefficient (PCC) and the second using Euclidean dis- tance. First, we will describe the method using PCC, as proposed in [6].
Suppose gk has a missing value in its qth entry. Then set gk0 = gk1 gk2 . . . gk(q−1) gk(q+1) . . . gkN .
Thus gk0 is the row vector gk where the missing value gkq is deleted. If more values in this gene are missing, we also delete these values. In the following we assume that only one value is missing, say gkq. If we want to compute the PCC between gk and gs we have to set up the vector g0s by deleting the qth entry. We assume that gs does not have a missing value in its qth entry.
The PCC rks between gk and gs is defined as rks = 1
N − 1
N −1
X
w=1
(gkw0 − ¯gk0
σk )(g0sw− ¯gs0 σs )
where ¯gk is the average of values in g0k, ¯gs the average of values in gs0 and σk and σs respectively the standard deviations of the elements of gk0 and g0s. The t genes with the largest absolute PCC with respect to gk are the genes that are most similar to gk. Thus if we denote these t genes by
gsi, where i = 1, . . . , t then
|rks1| ≥ |rks2| ≥ . . . ≥ |rkst| ≥ |rkst+1| ≥ . . . ≥ |rkM|
The second method to find the t genes uses Euclidean distance, this method is proposed in [9]. In this method the distance between two genes in defined as
dE(gi, gj) = v u u t
N
X
k=1
(gik0 − g0jk)2.
The t genes for which the Euclidean distance between this genes and gk is the smallest, are selected to be the t most similar genes, i.e.
dE(gk, gs1) ≤ dE(gk, gs2) ≤ . . . ≤ dE(gk, gst) ≤ dE(gk, gst+1) ≤ . . . ≤ dE(gk, gsM).
It is known [9] that the Euclidean distance is very sensitive to outliers. This problem does not occur with the PCC, therefore we will use the method using PCC to select the t genes.
7.2 Regressing with LSimputation
After we have obtained gs1, gs2, . . . , gst, we want to estimate the missing value gkq. We will estimate gkq using linear regression of gk against gs1, gs2, . . . , gst. Having done this, we will use the coefficients of regression to estimate gkq using the qth elements of these t genes. Now set
b = (gs1)q (gs2)q . . . (gst)q ,
then we want to find α1, α2, . . . , αt to estimate gkq in the following way
gkq=
t
X
p=1
αp(gsp)q = ba, where a =
α1 α2 ... αt
.
Now we set
As explained before we will then estimate the missing value as gkq = ba.
The solution to this problem is
a = (AT)†w,
where (AT)† is the Moore-Penrose pseudoinverse of AT (see section 2). In [8] it is proven that this is indeed the solution to this least squares problem.
Thus we estimate the missing value as
gkq = ba = b(AT)†w.
If gk has more than one missing value, then we have to adapt the algorithm.
In that case we should not create a vector b, but a matrix B ∈ <t×z (assuming z values are missing), where the t row vectors of B are formed by gsi for i = 1, . . . , t, where we only use the elements corresponding to elements in gk that are missing. If, for example, the third, fifth and tenth element are missing, then we only use the third, fifth and tenth elements of gsi to create the matrix B. The matrix A ∈ <t×(N −z) and the vector w ∈ <(N −z)×1 are formed in almost the same way as before, except that we take into account that more values are missing. Then the vector u of the z missing values is estimated as follows
u = BTa = BT(AT)†w.
This procedure has to be performed for all genes containing missing values.
In this way we get Eestimate.
7.3 Finding t
Before we can perform the algorithm described in section 7.1, we have to find the number t. This number is not the same for every gene, because it depends on the biological function of this gene. Moreover the samples we use in the DNA microarray also influence t. Therefore it is very difficult to find an algorithm that finds the best value for t. In [6] an algorithm for finding t is proposed. This algorithm finds t using the elements in gk that are known. The algorithm works as follows. In section 7.2 we described how we set up the matrices A and B and the vector w to compute the vector a with the coefficients of regressing. To find t, we also use these matrices and vectors. The vector w contains the elements of gk that are known. Suppose we pretend that w1 is also missing, so that we have z + 1 missing values and
we want to estimate w1 using the algorithm described in section 7.2. Now we can estimate w1 using different values of t. In this way we can compare the estimated value with the true value of w1. Then we choose the value of t that gives the best approximation for w1, i.e. | (w1)true− (w1)approximation | is as small as possible. This procedure could be repeated for several known values and finally we can choose the value of t that gives the best approximation.
7.4 Algorithm
1. Find the rows gk in which values are missing.
2. Find t using the method proposed in section 7.3.
3. For each row gkin which values are missing find the t most similar rows using PCC, as described in section 7.1.
4. Create the matrices A and B (or vector b if only one value is missing) as described in section 7.2. Also create the vector w.
5. The vector u with the z missing values is calculated as u = BTa = BT(AT)†w.
6. Replace the missing values in E by the values u1, u2, . . . , uz to get Eestimate.
7.5 Numerical results
If we run the matlab code in appendix E 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.1461
8 Fixed Rank Approximation method
In this section we will explain a method called the Fixed Rank Approximation mehod, we will abbreviate this as FRAM. FRAM makes use of the SVD, as dicussed in section 2. In this section we will first explain the idea behind FRAM, then we will give an algorithm to perform this method. This method comes from [3] and [12].
8.1 How does it work?
In section 2 we have discussed the SVD. We have mentioned that every ma- trix E can be decomposed as E = U ΣVT, with U and V orthogonal matrices and Σ a matrix whose off-diagonal entries are all zeros and whose diagonal entries satisfy σ1 ≥ σ2 ≥ . . . ≥ σr > σr+1 = . . . = σn = 0, the σi0s are called the singular values of E.
We have also explained that the number of nonzero singular values is equal to the rank of the matrix E. Suppose that our expression matrix E has missing values. Then we can remove all rows of E containing missing values, we call the remaining matrix E0. Let l0 denote the number of significant sin- gular values of E0 (see section 6.1 for an explanation of significant singular values). The idea behind FRAM is to find a matrix Eestimate of E such that the number of signifcant singular values of Eestimate is still l0.
We will explain why this assumption leads to reasonable estimations of the missing values. The number of nonzero singular values of E0 is equal to the rank of E0. However, the singular values also tell us how ’close’ E0 is to a matrix of lower rank. If the number of significant singular values is l0, then we can say that E0 is close to a matrix of rank l0, i.e. a matrix with l0 inde- pendent rows. We explained that FRAM is a method that is trying to find a matrix Eestimate of E with still l0 singular values. In this way Eestimate is also close to a matrix with rank l0 and consequently l0 independent rows. We could also say that the dimension of the row space of Eestimateis almost equal to l0. In that way the completed rows of Eestimate, the rows that contained missing values, are near to being dependent of the full rows, the rows of E0. If we want the number of significant singular values to be equal to l0, then we have to find Eestimatesuch that σl0+1, σl0+2, . . . , σr (singular values of Eestimate) are minimal, where r is the rank of Eestimate.
So we want to find a matrix Eestimate such that σl0+1, σl0+2, . . . , σr are as
small as possible. We can also say that PN
k=l0+1σk2 should be minimized.
From [3] we have an algorithm to find Eestimate where PN
k=l0+1σk2 is mini- mized. The algorithm works iteratively. In every step PN
k=l0+1σ2k will be smaller. In this way we need a stopping criterion to stop the algorithm. Here we again use the Frobenius norm as stopping criterion, see section 6.3.1. Now we will explain the algorithm and show that
N
X
k=l0+1
σk,E2 p ≥
N
X
k=l0+1
σk,E2 p+1.
We will follow the approach of [3] and [12]. Because we need to compute the SVD of E, we have to assign specific numbers to the empty entries. These numbers could be zeros (according to section 4) or numbers coming from another method, this matrix is called E0.
Let Ep+1 = Ep + Xp+1, with Xp+1 = (xij)M,Ni,j=1 where xij = 0 if element (i, j) is known and xij a free variable if (i, j) is missing in E.
Let x = (xj1i1, xj2i2, . . . , xjziz)T denote the z × 1 vector whose entries are in- dexed by the coordinates of the missing values in E. In this way there exists a z × z real valued symmetric nonnegative matrix Bp such that
xTBpx =
N
X
k=l0+1
vTp,kXp+1T Xp+1vp,k,
where vp,k is the eigenvector of Ap corresponding to λk,Ap. The (s, t) entry of Bp is given by
(B ) = 1 XN
vT (F (i , j )TF (i, j ) + F (i, j )TF (i , j ))v , where s, t = 1, . . . , z.
Now let Ap = EpTEp, then
N
X
k=l0+1
vTp,kEp+1T Ep+1vp,k=
N
X
k=l0+1
vp,kT (Ep+ Xp+1)T(Ep+ Xp+1)vp,k
=
N
X
k=l0+1
vp,kT (EpTEp+ EpTXp+1+ Xp+1T Ep+ Xp+1T Xp+1)vp,k
=
N
X
k=l0+1
vp,kT (Ap+ EpTXp+1+ Xp+1T Ep+ Xp+1T Xp+1)vp,k
=
N
X
k=l0+1
λk,Ap+ xTBpx + 2wpTx, where
wp = (wp,1, wp,2, . . . , wp,z)T and
wp,t=
N
X
k=l0+1
vp,kT EpTF (it, jt)vp,k, t = 1, . . . , z
In the above λk,Ap is the kth eigenvalue of Ap if λ1 ≥ λ2. . . ≥ λr. We know that PN
k=l0+1vp,kT Ep+1T Ep+1vp,k is always nonnegative. If we want to minimize this expression, we have to minimize xTBpx + 2wpTx. Differentiating this expression gives 2xTBp + 2wTp. Setting this equal to zero gives that the minimum is achieved when
Bpxp+1 = −wp.
In [3] it is proved that Bp is always nonsingular, consequently this equation can be solved easily. The vector xp+1 gives the entries for Xp+1 Because λk,Ap is the kth eigenvalue of Ap, we have
σk,E2 p = λk,Ap. From this we see that
Apvp,k= λk,Apvp,k
and consequently
vp,kT ApvTp,k= λk,Ap
= σ2k,Ep.
Using this and the fact Ep+1= Ep+ Xp+1 we have
N
X
k=l0+1
σ2k,E
p =
N
X
k=l0+1
vp,kT Apvp,kT
=
N
X
k=l0+1
vp,kT (Ep + 0)T(Ep+ 0)vp,k
≥
N
X
k=l0+1
vp,kT (Ep+ Xp+1)T(Ep+ Xp+1)vp,k
=
N
X
k=l0+1
vp,kT Ap+1vp,k
≥
N
X
k=l0+1
λk,Ap
=
N
X
k=l0+1
σk,E2 p+1.
The first ≥ sign comes from the fact that xp+1minimizes the expression. The second ≥ sign comes from the Ky-Fan characterization [4]. This character- ization will not be explained in this thesis, therefore we refer to [4]. So we see that indeed in every iteration PN
k=l0+1σk,E2 p becomes smaller.
In the next section section we will summarize this algorithm step by step.
8.2 The algorithm
1. Compute l0, the number of significant singular values of E0.
2. Find the missing values of E and let x = (x , x , . . . , x )T denote
7. Let Ep+1 = Ep + Xp+1, where xp+1 contains the entries for Xp+1, as explained in section 8.1.
8. Repeat this procedure (from step 4) until Ep+1 does not change any- more (according to the Frobenius norm). The final matrix Ep+1 is Eestimate.
8.3 Numerical results
If we run the matlab code in appendix F 5 times for 1%, 5%, 10%, 15%
and 20% of missing values, the mean of the errors e (see section 3) are the following
Percentage e
1% 0.1540
5% 0.3241
10% 0.4875
15% 0.5184
20% 0.7013
9 Conclusion
If we summarize all the numerical results, we have the following
Percen Zerorepl Row smean Col smean SVD1 SVD2 LS FRAM 1% 0.3920 0.1110 0.1096 0.1243 0.6975 0.1461 0.1540 5% 0.8623 0.2636 0.2583 0.2478 1.5759 0.4163 0.3241 10% 1.2417 0.4140 0.3590 0.3713 1.9243 0.6585 0.4875 15% 1.5316 0.4633 0.4353 0.4567 2.2972 0.8785 0.5184 20% 1.7673 0.5016 0.5472 0.6825 2.4120 1.0643 0.7013
(e) Error e for all methods for different percentages of missing values
10 Discussion
In section 9 we saw that for all percentages the row and column sum mean method work very well. For that reason we could say that it is not necessary to use such ’difficult’ methods as proposed in this thesis to estimate missing values in DNA microarrays. However, this conclusion cannot be stated this simple. In our testing matrix, we use only 1000 rows of the in total 5986 rows.
If we would use the full matrix, it is reasonable that our methods would give different results. For example, if we look at the LS method, we see that this method gives very bad results. From section 7 we know that this method estimates missing values making use of rows that are very similar to the row in which the value is missing. If we use only 16 part of the total matrix, it is reasonable that in the other part there are rows that are more similar to the
’target’ row than the rows we use. So we can expect that the error for the LS method would be smaller if we would use the full matrix. Unfortunately we cannot test this because matlab does not work with such a matrix.
Looking at the two methods based on the SVD, we see that the first method works significantly better than the second method. This is very peculiar because in section 6 we have mentioned that in our opinion there is not a reason that method 1 would work. Moreover, we have explained how method 2 works, but apparently this methods does give good results. Mathemati- cally we still cannot explain why method 1 works this well. Probably we need more biological understanding about DNA microarrays to be able to explain this. In [3] they mention that only a few eigengenes are needed to reveal the whole matrix E. This has been found by researchers that have worked with DNA microarrays. However, this fact has not been explained yet. Probably this can be done looking at the biological functions of eigengenes, but this falls out of the scope of this thesis.
Because we used the same error as in literature, we can also compare our results to the results in literature. In [3] they have tested the FRAM method against the row sum mean method. We see that their results for the row sum mean method are the same as our results. The results for FRAM differ.
For small percentages of missing values (1%-10%) the results are the same.
However, in [3] FRAM gives smaller errors than we have. The reason for this is probably that they have used the full matrix. So we see that FRAM would give better results when the full matrix is used. Unfortunately we have not been able to find literature in which they have tested the other methods.
A DNA microarray
Before performing the methods, we have to load the data. The data is available online [11]. We saved the data as an xlsx-file. The matrix is an 5986 × 17 matrix. This matrix is to big to work with in matlab. Therefore we have to remove rows to remain a matrix that is not too big for matlab.
We removed the last 4987 rows, the remaining matrix is of size 1000 × 17, i.e. a matrix with 1000 rows (genes) and 17 columns (samples). We save this as dnamicroarraytest.xlsx. We load the data in matlab as follows
%r e a d i n g t h e d a t a
datadna=x l s r e a d ( ’ d n a m i c r o a r r a y t e s t . x l s x ’ ) ;
%remove rows w i t h any NaN ( any empty e n t r y i s f i l l e d w i t h NaN) E f u l l=datadna ( ˜ any ( i s n a n ( datadna ) , 2 ) , : ) ;
E=E f u l l ;
%c r e a t e m a t r i x w i t h NaN’ s ( randomly ) NUM=numel (E ) ;
%p e r c e n=p e r c e n t a g e o f v a l u e s t h a t i s m i s s i n g
%A number s h o u l d be a s s i g n e d t o p e r c e n
E( randperm (NUM, f l o o r ( (NUM/ 1 0 0 ) ∗ p e r c e n ) ) = NaN ;
B Zero replacement method
The matlab code for the zero replacement method is the following f u n c t i o n e r r o r = z e r o r e p l (E , E f u l l )
[M,N]= s i z e (E ) ;
%r e p l a c e m i s s i n g v a l u e s by z e r o s E( i s n a n (E) ) = 0 ;
d i f f e r e n c e=E f u l l −E ;
%f i n d NaN e n t r i e s E nan=i s n a n (E ) ;
f o r i =1:M
%R e p a l c e NaN by mean o f t h a t row
E( i , E nan ( i , : ) ) = nanmean (E( i , : ) ) ; end
%f i n d d i f f e r e n c e e s t i m a t i o n and f u l l m a t r i x d i f f e r e n c e=E−E f u l l ;
%compute e r r o r
e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;
C.2 Column sum mean method
The matlab code for the column sum mean method is the following f u n c t i o n e r r o r = columnsummeanmethod (E , E f u l l )
%g e t s i z e o f E [M,N]= s i z e (E ) ;
%f i n d NaN e n t r i e s E nan=i s n a n (E ) ;
f o r i =1:N
%R e p l a c e NaN by mean o f t h a t column E( E nan ( : , i ) , i ) = nanmean (E ( : , i ) ) ; end
%f i n d d i f f e r e n c e e s t i m a t i o n and f u l l m a t r i x d i f f e r e n c e=E−E f u l l ;
%compute e r r o r
e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;
D Singular Value Decomposition methods
D.1 Estimation method 2 using SVD
The matlab code for the first singular value decomposition method is the following
f u n c t i o n e r r o r = svd1 (E , E f u l l , i t e r ) [M,N]= s i z e (E ) ;
%r e p l a c e NaN by z e r o
E=E ; Ep=E ;
Ep ( i s n a n (E) ) = 0 ; e r r =1;
i n d e x=f i n d ( i s n a n (E ) ) ; w h i l e ( e r r >0.01& i t e r >0)
%compute svd o f Ep [ U, S ,V]= svd ( Ep ) ; r=rank ( Ep ) ;
sigma=S ( S ˜ = 0 ) ; s i g m a s q r t=s q r t ( S ) ;
t o t a l s i n g u l a r=sum ( s i g m a s q r t ) ;
%compute f r a c t i o n
f r a c t i o n=s i g m a s q r t / t o t a l s i n g u l a r ;
%number o f s i g n i f i c a n t v a l u e s s u m f r a c=cumsum ( f r a c t i o n ) ; g r e a t e r=f i n d ( sumfrac >=0.7);
s=g r e a t e r ( 1 ) ;
[ U1 , S1 , V1]= s v d s ( Ep , s ) ;
%compute a p p r o x i m a t i o n Elow=U1∗ S1 ∗ t r a n s p o s e (V1 ) ;
f o r i =1: l e n g t h ( i n d e x )
Ep ( i n d e x ( i ))= Elow ( i n d e x ( i ) ) ; end
e r r=norm ( Ep−Elow , ’ f r o ’ ) ; i t e r =i t e r −1;
end Ep ;
d i f f e r e n c e=E f u l l −Ep ;
e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;
[M,N]= s i z e (E ) ;
%r e p l a c e NaN by z e r o Emiss=E ;
E=E ; Ep=E ;
Ep ( i s n a n ( Emiss ) ) = 0 ; e r r =1;
w h i l e ( e r r >0.01& i t e r >0)
%computer rank and SVD Ep r=rank ( Ep ) ;
[ U, S ,V]= s v d s ( Ep , r ) ; T=abs (V ) ;
f o r i =1:M
%f o r e a c h row f i n d m i s s i n g v a l u e s i n d e x r o w=f i n d ( i s n a n ( Emiss ( i , : ) ) ) ; x=t r a n s p o s e (E( i , : ) ) ;
%compute v1 , . . , v r w i t h v a l u e s d e l e t i n g c o r r e s p o n d i n g
%t o m i s s i n g
%v a l u e s
f o r j =1: l e n g t h ( i n d e x r o w )
T( ( i n d e x r o w ( j )−( j − 1 ) ) , : ) = [ ] ; x ( ( i n d e x r o w ( j ) −( j − 1 ) ) ) = [ ] ; end
%compute c o e f f i c i e n t s a=T\x ;
y=E( i , : ) ;
x=t r a n s p o s e ( y ) ; [ U, S ,V]= s v d s ( Ep , r ) ; T=abs (V ) ;
%compute e s t i m a t i o n s f o r m i s s i n g v a l u e s f o r j =1: l e n g t h ( i n d e x r o w )
q=T( i n d e x r o w ( j ) , : ) ;
x ( i n d e x r o w ( j ))= t r a n s p o s e ( a ) ∗ t r a n s p o s e ( q ) ; end
E( i , : ) = t r a n s p o s e ( x ) ; end
e r r=norm ( Ep−E , ’ f r o ’ ) ; Ep=E ;
i t e r =i t e r −1;
end
Ep ;
d i f f e r e n c e=E f u l l −Ep ;
%compute e r r o r
e r r o r =(norm ( d i f f e r e n c e , ’ f r o ’ ) ) / s q r t (M) / mean2 ( E f u l l ) ;
E Least Squares imputation method
For the LS method we have described an algorithm to find t (see section 7.3).
We will not implement this algorithm in matlab. The best value for t will be found be looking at the results we get from the code. The matlab code for the least squares imputation method is the following
f u n c t i o n e r r o r = LS (E , E f u l l , t )
%s e l e c t i n g t g e n e s [M,N]= s i z e (E ) ;
%r e p l a c e NaN by z e r o Emiss=E ;
E=E ; e r r =1;
Ep=E ;
Ep ( i s n a n ( Emiss ) ) = 0 ; E l a t=Ep ;
numremoved =0;
f o r i =1:M
%f o r e a c h row f i n d m i s s i n g v a l u e s Ep=E l a t ;
i n d e x r o w=f i n d ( i s n a n ( Emiss ( i , : ) ) ) ; numremoved =0;
numforremoved =0;
i f i s e m p t y ( i n d e x r o w)==0