Applications of Correspondence Analysis
in Microarray Data Analysis
by Ruixia Mu
B.Eng., Tianjin University, China, 1983
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
MASTER OF SCIENCE
in the Department of Mathematics and Statistics
c
Ruixia Mu, 2008 University of Victoria
All rights reserved. This thesis may not be reproduced in whole or in part, by photocopy or other means, without the permission of the author.
Applications of Correspondence Analysis
in Microarray Data Analysis
by
Ruixia Mu
B.Eng., Tianjin University, China, 1983
Supervisory Committee Dr. Mary Lesperance, Supervisor
(Department of Mathematics and Statistics)
Dr. Bill Reed, Departmental Member
(Department of Mathematics and Statistics)
Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)
Dr. Caren C. Helbing, External Examiner (Department of Biochemistry and Microbiology)
iii
Supervisory Committee
Dr. Mary Lesperance, Supervisor
(Department of Mathematics and Statistics)
Dr. Bill Reed, Departmental Member (Department of Mathematics and Statistics)
Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)
Dr. Caren C. Helbing, External Examiner (Department of Biochemistry and Microbiology)
ABSTRACT
Correspondence analysis is a descriptive and explorative technique for the study of asso-ciations between variables. It is a visualization method for analyzing high dimensional data via projections onto a low-dimensional subspace. In this thesis, we demonstrate the applicability of correspondence analysis to microarray data. We show that it can be used to identify important genes and treatment patterns by coordinating and project-ing the genes and the experimental conditions. In addition, we estimate missproject-ing values in the gene expressions using the Expectation-Maximization (EM) algorithm and iden-tify genes with large between-condition variability using the projections of the genes and the conditions. To demonstrate its application, correspondence analysis is applied to various simulated data and microarray data from the EPA (Environmental Protec-tion Agency) studies. We conclude that correspondence analysis is a useful tool for analyzing the associations between genes and experimental conditions, for identifying important genes, and for estimating missing values.
Supervisory Committee ii
Abstract iii
Table of Contents iv
Acknowledgments xiii
1 Introduction 1
1.1 Background on DNA Arrays . . . 2
1.2 Data Sources . . . 5
1.3 Summary of Thesis . . . 6
2 Literature Review 8 2.1 Quality Control for DNA Microarrays with Replicates . . . 8
2.2 Estimation of Missing Values . . . 10
2.2.1 Notation . . . 11 2.2.2 SVDImpute . . . 12 2.2.3 KNNImpute . . . 14 2.2.4 LSImpute . . . 15 2.2.5 OLSImpute . . . 18 2.2.6 PLSImpute . . . 18 iv
v
2.2.7 LLSImpute . . . 21
2.3 Identification of Differentially Expressed Genes . . . 22
2.4 Application of Correspondence Analysis to Microarray Data . . . 25
3 Methods 29 3.1 Correspondence Analysis . . . 29
3.1.1 Introduction . . . 29
3.1.2 Basic Concepts and Definitions . . . 31
3.1.3 Algebra of Correspondence Analysis . . . 34
3.1.4 Projection and Two-Dimensional Maps . . . 39
3.2 Algorithm for Identifying DE Genes and Outliers . . . 44
3.3 EM Algorithm for Missing Value Estimation . . . 48
4 Simulation Study 52 4.1 Determination of Simulation Models . . . 53
4.2 Simulation Experiments . . . 70
4.2.1 Experiment #1 . . . 72
4.2.2 Experiment #2 . . . 75
4.2.3 Experiment #3 . . . 85
4.2.4 Experiment #4 . . . 87
5 Application to Real Microarray Data 93 5.1 Datasets . . . 94
5.2 Missing Value Estimation . . . 96
5.3 Identification of Differentially Expressed Genes . . . 98
6 Discussion and Conclusion 103 6.1 Application on Elucidating the Association between Genes and Treatments103 6.2 Application on Imputing Missing Values . . . 104
6.3 Application on Identifying DE Genes or Outliers . . . 107
Bibliography 111 Appendix 116 A Results of Simulations 116 A.1 Graphs of Correspondence Analysis for Experiment #1 . . . 116
A.2 Results of Simulation Studies . . . 144
B Script Files 147 B.1 Computational Program of Correspondence Analysis . . . 148
B.1.1 Original CA computation . . . 148
B.1.2 Modified CA computation . . . 149
B.1.3 Computational program of DE genes identification . . . 150
B.2 Programs of Missing Value Imputation . . . 151
B.2.1 CAEMImpute . . . 151
B.2.2 KNNImpute . . . 153
B.2.3 OLSImpute . . . 154
B.2.4 PLSImpute . . . 155
B.3 Programs of Data Simulation . . . 158
B.3.1 Function for simulating ratio data . . . 158
B.3.2 Function for simulating intensity data . . . 158
B.3.3 Function for simulating DE data . . . 159
B.4 Scrip Files of Simulation Experiments . . . 161
B.4.1 Experiment 1 . . . 161
B.4.2 Experiment 2 . . . 165
B.4.3 Experiment 3 and Experiment 4 1 . . . 169
vii
List of Tables viii
3.1 Algebra of correspondence analysis. . . 38
3.2 Matrix of artificial data. . . 41
3.3 Relative frequencies of smoking categories within staff-groups. . . 41
4.1 Numbers of distinct genes in each of data sets to be examined. . . 54
4.2 Cut points and percentages of genes having high intensities in each of treatments in the EPAA study. . . 54
4.3 The estimates of parameters for log-normal and uniform distributions for each data set and averages over the data sets. . . 63
4.4 Numbers of genes in each of time points. . . 66
4.5 The estimates of parameters for normal and exponential distributions for each data set and averages over the data sets. . . 66
4.6 Factors and levels for the simulation Experiment #1. . . 73
4.7 The proportions of inertia displayed in the first two dimensions of CA. 74 4.8 Factors and levels for simulation Experiment #2. . . 76
4.9 The ratios of the ExtraGenes. . . 77
5.1 Summary of some characteristics of the EPAA and EPAB intensity data from brain tissues under 24 hours. . . 96
ix
5.2 RMSEs from the CAEMImpute and KNNImpute for the intensity data
with different percents of missing values. . . 98
A.1 Effect of the sample size on TPR and FPR of the DE genes identified. . 144
A.2 Effect of the proportion of DE genes on TPR and FPR of the DE genes identified. . . 144
A.3 Effect of the number of treatments where a DE gene is multiply differ-entially expressed on TPR and FPR of the DE genes identified. . . 145
A.4 Effect of the proportion of the multiply expressed genes on TPR and FPR of the DE genes identified. . . 145
A.5 Effect of the proportion of down-regulated genes on TPR and FPR of the DE genes identified. . . 145
A.6 TPR and FPR of the Experiment #3. . . 146
A.7 TPR and FPR of the Experiment #4 1. . . 146
A.8 RMSE of the Experiment #4 2. . . 146
C.1 Missing value imputations for the EPAA study. . . 173
C.2 Missing value imputations for the EPAB T3 study. . . 174
C.3 Missing value imputations for the EPAB T4 study. . . 176
C.4 Names and ratios of DE genes identified from the EPAA data. . . 179
C.5 Names and ratios of DE genes identified from the EPAB T3 data. . . . 181
List of Figures x
3.1 Two-dimensional displays obtained by correspondence analysis of the
smoking habit data. A: the asymmetric map; B: the symmetric map. . 42
3.2 Two-dimensional displays with confidence ellipse for a simulated mi-croarray experimental data. . . 45
4.1 Histograms for the EPAA study data at t = 24 hours. . . 55
4.2 Histograms for the EPAA study data at t = 48 hours. . . 56
4.3 Histograms for the EPAA study data at t = 96 hours. . . 56
4.4 Plots for the EPAA study, the control t = 24 hours. . . 57
4.5 Plots for the EPAA study, the Methimazole t = 24 hours. . . 57
4.6 Plots for the EPAA study, the Propulthiouracil t = 24 hours. . . 58
4.7 Plots for the EPAA study, the Perchlorate t = 24 hours. . . 58
4.8 Plots for the EPAA study, the control t = 48 hours. . . 59
4.9 Plots for the EPAA study, the Methimazole t = 48 hours. . . 59
4.10 Plots for the EPAA study, the Propulthiouracil t = 48 hour. . . 60
4.11 Plots for the EPAA study, the Perchlorate t = 48 hours. . . 60
4.12 Plots for the EPAA study, the control t = 96 hours. . . 61
4.13 Plots for the EPAA study, the Methimazole t = 96 hours. . . 61
4.14 Plots for the EPAA study, the Propulthiouracil t = 96 hours. . . 62
xi
4.15 Plots for the EPAA study, the Perchlorate t = 96 hours. . . 62
4.16 Histograms and normal Q-Q plots for the LPS log2(ratio) data. . . 65
4.17 Plots for not regulated genes of the LPS log2(ratio) data. . . 67
4.18 Plots for down regulated genes of the LPS log2(ratio) data. . . 68
4.19 Plots for up regulated genes of the LPS time-course log2(ratio) data. . 69
4.20 Plots of ATPR against the proportions of DE genes. . . 79
4.21 Plots of AFPR against the proportions of DE genes. . . 79
4.22 Plots of ATPR against the number of treatments. . . 80
4.23 Plots of AFPR against the number of treatments. . . 80
4.24 Plots of ATPR against the proportions of MDE genes. . . 82
4.25 Plots of ATPR against the proportions of MDE genes. . . 82
4.26 Plots of ATPR against the proportions of down-regulated genes. . . 84
4.27 Plots of ATPR against the proportions of down-regulated genes. . . 84
4.28 Plots of (a) ATPR and (b) AFPR in Experiment# 3 against the number of replicates. . . 88
4.29 Plots of ATPR (a) and AFPR (b) in Experiment# 4 against the number of replicates. . . 88
4.30 Plots of REMS against the number of treatments, (a) the KNNImpute method (b) the OLSImpute method. . . 91
4.31 Plots of REMS against the number of treatments between four imputa-tion methods. . . 91
5.1 CA plots for the EPAA study in the first two dimensions. . . 99
5.2 CA plots for the EPAB T3 study in the first two dimensions. . . 100
5.3 CA plots for the EPAB T4 study in the first two dimensions. . . 101
A.1 CA plots for N = 100, M = 3, and PDE = 0.05. . . 117
A.2 CA plots for N = 100, M = 3, and PDE = 0.10. . . 118
A.4 CA plots for N = 100, M = 4, and PDE = 0.05. . . 120
A.5 CA plots for N = 100, M = 4, and PDE = 0.10. . . 121
A.6 CA plots for N = 100, M = 4, and PDE = 0.15. . . 122
A.7 CA plots for N = 100, M = 5, and PDE = 0.05. . . 123
A.8 CA plots for N = 100, M = 5, and PDE = 0.10. . . 124
A.9 CA plots for N = 100, M = 5, and PDE = 0.15. . . 125
A.10 CA plots for N = 500, M = 3, and PDE = 0.05. . . 126
A.11 CA plots for N = 500, M = 3, and PDE = 0.10. . . 127
A.12 CA plots for N = 500, M = 3, and PDE = 0.15. . . 128
A.13 CA plots for N = 500, M = 4, and PDE = 0.05. . . 129
A.14 CA plots for N = 500, M = 4, and PDE = 0.10. . . 130
A.15 CA plots for N = 500, M = 4, and PDE = 0.15. . . 131
A.16 CA plots for N = 500, M = 5, and PDE = 0.05. . . 132
A.17 CA plots for N = 500, M = 5, and PDE = 0.10. . . 133
A.18 CA plots for N = 500, M = 5, and PDE = 0.15. . . 134
A.19 CA plots for N = 1000, M = 3, and PDE = 0.05. . . 135
A.20 CA plots for N = 1000, M = 3, and PDE = 0.10. . . 136
A.21 CA plots for N = 1000, M = 3, and PDE = 0.15. . . 137
A.22 CA plots for N = 1000, M = 4, and PDE = 0.05. . . 138
A.23 CA plots for N = 1000, M = 4, and PDE = 0.10. . . 139
A.24 CA plots for N = 1000, M = 4, and PDE = 0.15. . . 140
A.25 CA plots for N = 1000, M = 5, and PDE = 0.05. . . 141
A.26 CA plots for N = 1000, M = 5, and PDE = 0.10. . . 142
xiii
Acknowledgment
I would like to begin by thanking my supervisor Dr. Mary Lesperance for her patience, support, and guidance in countless ways. I feel very fortunate to have had Dr. Lesperance as my supervisor. I also would like to thank Dr. Caren C. Helbing and all members of our research team for their support and friendship.
Many thanks go to the faculty and staff of the Department of Mathematics and Statistics of the University of Victoria for providing financial support, a stimulating work environment and inspiration.
Finally and most importantly, I would like to thank my husband and daughter for their love, help, and support.
Introduction
DNA array technology was developed to study the expression levels of thousands of
genes in a biological sample simultaneously. It has been applied in numerous studies
over a broad range, from the study of gene expression in yeast under different
envi-ronmental conditions to the comparison of gene expression profiles for tumors from
cancer patients (Dudoit et al., 2002). In addition to the potential to help advance
biological knowledge at a genomic scale, microarrays have very important applications
in pharmaceutical and clinical research (Draghici, 2003). However, one remaining
chal-lenge is how to analyze and interpret the resulting large amounts of data generated by
microarray experiments. A common task in analyzing microarray data is to identify
genes with altered expression in samples drawn from two different tissues or at two
different time points or conditions. As the biological systems are usually dynamic and
interactive, it is better to use a distribution-free method to capture and present the
2
diverse expression patterns.
Correspondence analysis (CA) is a descriptive/exploratory technique designed for
investigating the association between row and column variables by graphically
display-ing the patterns and structures existdisplay-ing in the data. It has been widely applied on
categorical data in the social sciences, environmental science, and marketing research
(Blasius et al., 1998). This method has recently been introduced to the analysis of
microarray data (Kishino and Waddell, 2000; Fellenberg et al., 2001; Tan et al., 2004;
Yano et al., 2006) to uncover the associations between genes and tissues/treatments
based on gene expressions. One major requirement of CA is that the data matrices
have no missing values. In this study, we explore the feasibility of applying the CA
technique to the identification of differentially expressed (DE) genes, the estimation of
missing values, and the examination of the quality of replicate microarrays.
1.1
Background on DNA Arrays
A new area of genome research called functional genomics has been developed as
re-searchers continually increase the rate at which genomes are sequenced. Functional
genomics aims to discover the biological function of DNA sequences. As the complete
DNA sequences of many genomes are already known, an important and difficult task
is to define the role of each gene and understand how the genome functions as a whole.
developed to exploit DNA sequence data and yield information about gene expression
levels for entire genomes (Draghici, 2003).
DNA consists of two strands of linked nucleotides with one of the four bases:
ade-nine (A), thymine (T), guaade-nine (G) and cytosine (C). These four bases form the genetic
alphabet. The genetic information is encoded in strings of variable length formed with
letters from this alphabet and is stored in various very long strings of DNA. The two
strands of the DNA molecule are held together by hydrogen bonds according to the
base-pairing rules: G pairs with C and A pairs with T. For example, if one strand
has the sequence A-G-C-C-G-T-A-T-C, the other strand will have the complementary
sequence T-C-G-G-C-A-T-A-G. Various substrings of DNA constitute functional units
called genes, and contain information necessary to construct proteins which are the
ul-timate expression of the genetic information. The expression of the genetic information
stored in the DNA molecule occurs in two stages: transcription and translation. During
transcription, the genetic information contained in a gene from the DNA is converted
into messenger ribonucleic acid (mRNA). mRNA is a single-stranded complementary
copy of the base sequence in the DNA molecule, with the base uracil (U) replacing
thymine (T) (e.g. U-C-G-G-C-A-U-A-G). Translation is the process of converting the
message from the 4 letter RNA alphabet to the 20 letter alphabet of amino acids used
to build proteins (Berrar et al., 2003) according to a code in which each amino acid
is represented by three bases in mRNA. The types and amounts of protein generated
4
changes in gene expression levels can give rise to major changes at the organismic
level and induce illnesses such as cancer. Therefore, comparing the expression levels of
various genes between different conditions is of extreme interest to life scientists.
Microarray technology has been developed to investigate the properties of gene
ex-pression at the transcription or translation levels as well as subcellular localization of
gene products. Commonly, attention has focused primarily on expression at the
tran-scription stage, i.e. on mRNA levels, because mRNA levels sensitively reflect the type
and state of the cell. Microarrays make use of a key property of the DNA duplex - the
sequence complementarity of the two strands, which makes hybridization, a chemical
reaction, possible. During hybridization, two complementary nucleic acid strands, for
example A-G-C-C-G-T-A-T-C and T-C-G-G-C-A-T-A-G, combine to form a
double-stranded molecule according to the base-pairing rules. To utilize the hybridization
property of DNA, complementary DNA (cDNA) (e.g. A-G-C-C-G-T-A-T-C) is
ob-tained from mRNA (e.g. A-G-C-C-G-U-A-U-C) by reverse transcription.
There are two main types of cDNA microarrays: nylon membrane-based filter arrays
and chemically coated glass-based arrays (Berrar, 2003). The nylon membrane arrays
are hybridized with radioactively labeled cDNA targets, and glass arrays are hybridized
with fluorescent dye-labeled targets. In a microarray experiment, samples of DNA
clones with known sequence content are spotted and immobilized on to a glass slide
or nylon filter. The mRNA extracted from the tissue cell under study is purified,
fluorescent dyes (e.g. a red-fluorescent dye Cy5 and a green-fluorescent dye Cy3).
Labeled cDNA hybridizes to the spots containing complementary sequences on the
array. After hybridization, the radioactive or fluorescent signal intensities are measured
using a phosphorimager or laser scanner, respectively. One intensity is measured on
each spot for the radiation-labeled array (one-channel array) while two intensities are
measured on each spot for the fluorescence dye-labeled array (two-channel array). In
both cases, the intensities are used as the expression levels of genes in the sample under
study (Draghici, 2003). In this study, we primarily use data from one-channel arrays,
but the results are applicable to two-channel arrays.
1.2
Data Sources
In this thesis, two kinds of data, real and simulated microarray experimental data, are
used to demonstrate the applications of CA. The real microarray data are three of a
series of data for the US Environmental Protection Agency (EPA) from Dr. Caren
Helbing’s lab. The EPA series consists of two sets of experiments, EPAA and EPAB,
using Xenopus laevis frog tadpole brain, leg, and tail tissues. The EPAA studies are
these tissues exposed to three different thyroid hormones (TH) inhibitors: methimazole
(Meth), propylthiouracil (PTU), and perchlorate (Per), along with a control condition
(Con) for 24, 48, and 96 hours. The EPAB data are obtained from the same tissues
6
3,5,3’-triiodothyroine (T3), along with their corresponding control conditions, for 24,
48, and 96 hours (Helbing et al., 2007). Three data sets used in our study are the
brain tissue of the EPAA, EPAB (T3) and (T4) studies at 24 hours. More details about
experimental procedures and data processes are given in Chapter 5.
The simulated data are generated according to the distributions and the factors
en-countered in real biological microarray experiments. The main factors are the number
of genes and treatments, as well as the proportion of DE genes. More details of the
simulation study are given in Chapter 4.
1.3
Summary of Thesis
This thesis is organized as six main chapters. Chapter 1 provides a brief scientific
background on microarray technology and a summary of the data sets used in this
study.
Chapter 2 provides a literature review of different statistical methods used to
iden-tify differentially expressed genes, missing value estimation, quality investigation of
replicated microarray data, and applications of correspondence analysis to microarray
data.
Theoretical details of correspondence analysis are described in Chapter 3. The
algorithm used for missing value estimation, the computational methods proposed for
of replicated arrays are covered in this chapter.
Chapters 4 and 5 give details about the implementation of each method listed in
Chapter 3. The simulation studies together with the analysis are presented in Chapter
4 and the application to real microarray data is provided in Chapter 5. Chapter 6
Chapter 2
Literature Review
2.1
Quality Control for DNA Microarrays with
Repli-cates
The most important task before analyzing microarray experimental data is to review
the quality of the data obtained from each microarray experiment, on a global array
basis and also on an individual spot basis. Since a microarray experiment involves
a large number of steps, from experimental setup including array preparation and
RNA sample preparation, hybridization to image analysis, cDNA microarray data often
contains unusual observations. Quality assured data are required for any reliable
high-level analysis. In addition, quality control allows the experimenter to monitor and
improve the laboratory procedures.
Dumur et al. (2004) studied quality control criteria used during sample preparation,
e.g. RNA sample isolation and cDNA synthesis products. Many methods, such as
image quantification, data filtering, transformation, and normalization of arrays, have
been proposed for improving data quality on the global array basis (Smyth et al.,
2002; Berrar et al., 2003; Draghici, 2003). Data filtering methods assign pass/fail
criteria to the quality control measures computed, which are used to filter out unusual
spots, such as overlapping spots, contamination, saturated measurements, and to detect
poor quality spots. Some simple filtering methods are applied for quality control on
replicated expression levels of a particular gene. For example, some researchers set a
cut-off value for the standard deviation for each gene between replicate arrays to filter
out unreliable genes (Helbing et al., 2007; Mookherjee et al., 2006).
Gottardo et al. (2006) proposed a way to filter and flag gene and array outliers
based on the parameter estimates from a Bayesian hierarchical model. The Bayesian
hierarchical model is developed to estimate the expression levels of genes in a robust
way, in which outliers are modeled using a t-distribution; the non-constant variances
and design effects are used to measure the quality of replicates. They show that
this model works well with four replicates of microarrays but it requires considerable
computing time since the estimation of parameters in the model is performed using
10
2.2
Estimation of Missing Values
Gene expression data from microarray experiments usually have missing values for
var-ious reasons, e.g. image corruption, insufficient resolution, or simple dust or scratches
on the slide (Troyanskaya et al., 2001). However, many multivariate statistical
anal-ysis methods, such as correspondence analanal-ysis (CA), principal components analanal-ysis
(PCA), and singular value decomposition (SVD) (Alter et al., 2002), cannot be applied
to data with missing values. Missing value estimation is an important preprocessing
step (Berrar et al., 2003) in microarray data analysis. Because of the cost and time
to repeat a microarray experiment, there are several simple methods for dealing with
missing values, such as ignoring the entries containing missing values, replacing missing
values by zero or imputing missing values using expression averages or medians across
all arrays (Berrar, 2003). Current research shows that missing value estimation can be
significantly improved by considering the correlation between data. Several methods
including the K-nearest neighbors method (KNNImpute), the SVD method
(SVDIm-pute) (Troyanskaya et al., 2001), least squares imputation (LSIm(SVDIm-pute) (Bo et al., 2004),
ordinary least squares (OLSImpute), partial least squares (PLSImpute) (Nguyen et al.,
2004), and local least squares imputation (LLSImpute) (Kim et al., 2005) have been
2.2.1
Notation
Here, we introduce some notation commonly used with imputation algorithms.
Let A be an n x m matrix representing microarray data with n genes and m arrays
and akj be the expression of gene k under array j. Denote the expression matrix
without gene i as A−i and the expression matrix without array j as A(−j).
A particular gene, say gene i, with missing values to be imputed is called the target
gene and its expression is denoted by a vector ai. Let a length l vector, v, denote the
set of arrays in which a target gene has missing values, thus, the available expression
values and the missing values of the target gene i are expressed as ai(−v) and ai(v), an
empty vector.
All genes with available values in the array v corresponding to a missing value aiv
make up the set of candidate genes for estimating the missing value, aiv, and is denoted
by X. Let G represent the matrix of the original expression of K significant candidate
genes for a particular target gene i, G(−v) denote the matrix of G without the arrays
12
expressions under array v, given by (2.1), (2.2), and (2.3), respectively.
G = g1,1 · · · g1,m ... . .. ... gK,1 · · · gK,m K×m , (2.1) G(−v) = g1,1 · · · g1,vj−1 g1,vj+1 · · · g1,m ... . .. ... ... . .. ... gK,1 · · · gK,vj−1 gK,vj+1 . .. gK,m K×m−l , (2.2) g(v) = (g1,v g2,v · · · gK,v)T, (2.3) for 1 ≤ j ≤ l.
2.2.2
SVDImpute
SVDImpute (Troyanskaya et al., 2001) applies singular value decomposition to obtain
a set of K significant genes, called eigengenes. The expressions of these eigengenes
can be linearly combined to approximate the expression of all genes in the data set.
To perform SVD on the incomplete matrix, replace all missing values in the original
matrix A with the corresponding row averages, denoted as A′
, and then perform SVD
on A′
to obtain
A′
n×m = Un×nΛn×mVTm×m, (2.4)
where UTU = I
n×n and VTV = Im×m, i.e. U and V are orthogonal. The columns of U are the left singular vectors (gene coefficient vectors); Λ is a diagonal matrix
containing singular values; and the rows of VT are the right singular vectors (expression
level vectors). By sorting the genes based on their corresponding singular values in
descending order, the first K significant genes are selected, with K < m − l, to form
the matrix G.
A linear regression model with the target gene as the response variable and
expres-sions of G(−v) as predictors is fitted, as in Equation (2.5). The missing value aiv is
predicted using the fitted regression model, given in Equation (2.6).
aTi(−v) = 1T GT(−v) β0 β1 ... βK , (2.5) baiv = 1 gT (v) b β0 b β1 ... b βK . (2.6)
This procedure is repeated on the new matrix A′
, where all imputed missing values
are replaced by new estimates, until the total change of the matrix is not greater than
an empirical threshold of 0.01, i.e. Pni=1Pmj=1|a′(nit+1)
i,j − a
′n it
i,j | ≤ 0.01, where nit is the number of iterations.
14
2.2.3
KNNImpute
KNNImpute (Troyanskaya et al., 2001) is proposed as a robust and sensitive method
for missing value estimation. This method aims to find K genes from the candidate
gene set X(−v) so that their expression levels are most similar to ai(−v). The similarity
is measured by an Euclidean distance defined as
d2iz = m−l X j=1
(aij − xzj)2, (2.7)
where m − l is the number of available expression levels of the target gene i, and aij
and xzj are the expression levels of gene i and gene z in array j, for z = 1, · · · , Z
and z 6= i, with Z denoting the number of candidate genes. The estimated value for a
missing value aiv is the weighted average of expression values of these K closest genes
in array v, baiv = K X k=1 wkgkv, (2.8)
where wk = 1/(dikC) are the weights and C = Pd−ik1 is the normalizing weight
con-stant, for k = 1, ..., K. Weights are the inverse of the distances, thus, giving higher
weights to expression values of candidate genes more similar to the target gene i.
Troyanskaya et al. (2001) found that in most cases, KNNImpute, with K in the
range from 10 to 20, performs better and is more robust on non-time series data or
time-series data with no obvious global correlation structures. On the other hand,
SVDImpute works well on time series data with low noise levels and with strong global
2.2.4
LSImpute
LSImpute (Bo et al., 2004) tries to adaptively exploit the correlations between genes
and between arrays to estimate the missing values based on the least squares
regres-sion model. This algorithm first uses two basic methods, LSImpute gene and
LSIm-pute array, for imputation of missing values. Then it produces weighted averages of
the estimates from the LSImpute gene and LSImpute array methods.
LSImput gene considers the correlations between candidate genes and the target
gene i, in which the arrays are observations. The first step in the LSImpute gene
method is, from the set of candidate genes, selecting K genes most correlated with the
target gene i based on the absolute correlation values to form the expression matrix G.
The next step is fitting K simple linear regression models from each of the K selected
genes with available expression levels of the target gene i, given in Equation (2.9).
aTi(−v) = 1T gTk(−v) αk βk , (2.9)
where gk(−v) is the kth most correlated gene excluding the arrays v where the
tar-get gene i has missing values, for k = 1, . . . , K. From the K models, K estimates
16
average of the estimates is obtained using Equation (2.11).
baivk = bαk+ bβkgkv, (2.10) baiv = K X k=1 wkbaivk, (2.11) with wk = ( r2 aigk 1 − r2 aigk+ ǫ )2, (2.12)
where wk is the weight of the gene gk, raigk is the estimated correlation between the
target gene i and the kth gene in the set of K selected genes, and ǫ = 10−6 in case of
r2
aigk = 1.
In the LSImpute array method, multiple regression is applied based on the
corre-lation between arrays and now the genes are the observations. Assume that the target
gene i has l missing values in the first l arrays ai(v) = [ai,1, . . . , ai,l] and m − l
non-missing values ai(−v) = [ai,l+1, . . . , ai,m], where m is the total of arrays. The missing
values in the data set A are substituted by the estimates from LSImpute gene method
to implement LSImpute array method, denoted as A′
. The imputed expression vector
ai(v) is denoted as a′
i(v). Let a′(v) = [a1, · · · , al] and a′(−v) = [al+1, · · · , am] be the vectors of column means of A′
, where aj is the mean expression of the array j. Then
the least squares estimates of ai(v) given ai(−v) is
b aTi(v) = a′ T (v)+ Sa′T i(v)ai(−v)S −1 aT i(−v)ai(−v) (aTi(−v)− a′ T (−v)), (2.13)
with Sa′T i(v)ai(−v) = sa′
i,1ai,l+1 · · · sa′i,1ai,m
... . .. ... sa′
i,lai,l+1 · · · sa′i,lai,m
, (2.14) Sai(−v)ai(−v) =
sai,l+1ai,l+1 · · · sai,l+1ai,m
· · · . .. · · · sai,mai,l+1 · · · sai,mai,m
, (2.15) and sxy = 1 n − 1 n X i=1 (xi− x)(yi− y). (2.16)
In the above Equations, the elements sxy’s of Sa′T
i(v)ai(−v) and SaTi(−v)ai(−v) are empirical
covariances between array x and array y. Finally, a missing value aivis estimated using
the LSImpute gene estimate baiv gene and the LSImpute array estimate baiv array by
baiv= pbaiv gene+ (1 − p)baiv array, (2.17)
where p is a mixing coefficient and p ∈ [0, 1]. The value of p is determined by minimizing
the sum of squared errors obtained by re-estimating 5% of the known values and
treating them as missing: X
e2c = X[pegene+ (1 − p)earray]2, (2.18)
where egene = baiv gene− aiv,
earray = baiv array− aiv.
Here aivis the observed value. This algorithm is favorable compared with the
18
respect to running time as the number of missing values increase (Bo et al., 2004).
2.2.5
OLSImpute
OLSImpute (Nguyen et al., 2004) is similar to the regression method, LSImpute gene
of LSImpute (Bo et al., 2004). Rather than using the absolute correlation as the
criterion to select K significant candidate genes, OLSImpute selects the K genes based
on Euclidean distance. An ordinary least squares regression is fitted for each of the K
selected genes, which is the same as the Equation (2.9) in the LSImpute gene method.
One of K estimates of a missing value aiv is predicted using Equation (2.10). The final
estimate of aiv is the weighted average of the K separate estimates,
baiv = K X k=1
wkbaivk. (2.19)
The weights can be based on the distance used to select the K candidate genes, wk =
1/(dk P
(1/dk)). If equal weight is desired for each of the K separate estimates, then
wk= 1/K.
2.2.6
PLSImpute
Partial least squares (PLS) imputation (PLSImpute) (Nguyen et al., 2004) is a
regres-sion method with a sequence of gene components as predictors and the target gene as
response variable. The gene components are constructed using the expression levels of
values are predicted using the fitted regression model. As defined before, ai(−v) is a
vector of available values of target gene i, ai(v) a vector of missing entries to be
im-puted. The expression matrix of the candidate genes is denoted as X which can be
partitioned according to the available values and missing values of gene i as follows,
ai = ai(−v) ai(v) , X = X(−v) X(v) ,
where X(−v) is a matrix of available values corresponding to ai(−v) and X(v) consists
of available values corresponding to the missing values ai(v) of target gene i. The pair
(XT
(−v), aTi(−v)) is the training data and X(v) is the test data used to predict the missing
values ai(v).
Since the number of samples (arrays) is much smaller than the number of available
genes in the training data, PLS reduces dimensions by extracting the gene components
sequentially. The gene components are the linear combinations of the set of candidate
genes with maximum covariance with the target gene i. The kth gene component for
the target gene i is given as tik(−v) = XT(−v)wik, where wik is a weight vector satisfying
the following criterion,
wik = max[cov2(XT(−v)wik, aTik(−v))], (2.20)
where w′
ikwik = 1. wik is subject to the orthogonality constraints
wTikSwid = 0, for all 1 ≤ d < k, (2.21)
20
A linear regression model is fitted with the available expression levels of a target
gene i as the response and K PLS gene components as predictors,
aTi(−v) = 1T ti1(−v) · · · tiK(−v) β0 β1 ... βK , (2.22)
where tik(−v) is the kth gene component and βk’s are the least squares regression
coefficients of the target gene i. The weights and the test data X(v) are used in
constructing the test PLS components to impute the missing values ai(v). The kth test
PLS component is given as tik(v) = XT(v)wik. The missing values are predicted using
the fitted model,
b aTi(v) = 1T ti1(v) · · · tiK(v) b β0 b β1 ... b βK . (2.23)
In this method, the missing values in matrix X(−v) are allowed and replaced by the
estimates from KNNImpute before constructing PLS gene components.
Nguyen al et. (2004) provided all the algorithms of KNNImpute, OLSImpute, and
PLSImpute and evaluated their performances measured by the normalized relative
estimation error (RAE),Pi(|y − ˆy|)/y, where y is the vector of true expression values,
found that the KNNImpute method performs well on the average for the microarray
data sets they used and good accuracy in imputing missing values can be achieved
using 3 or 4 PLS components in PLSImpute method.
2.2.7
LLSImpute
Kim et al. (2005) developed a local least squares (LLSImpute) method based on
the KNNImpute algorithm and LSImpute method. In this method, K genes most
correlated with a target gene i are selected from the set of candidate genes based
on Euclidean distance or the Pearson correlation coefficient. Once the K genes are
identified, the known expression values ai(−v) of the target gene i is represented as a
linear combination of the K genes
aTi(−v) = GT(−v)b, (2.24)
where GT
(−v) is the expression matrix of the K genes excluding the arrays v in which the target gene i has missing values, and b are the coefficients of the linear combination
obtained from least squares. The missing values a(v) can be estimated by
b
aT(v) = GT(v)bb. (2.25)
They demonstrated that LLSImpute showed the best performance when strong local
22
2.3
Identification of Differentially Expressed Genes
One of the main goals of microarray data analysis is to compare the expression levels
of genes in two different samples, one sample is considered the reference or control and
the other is considered the treatment. Obvious examples include comparing healthy
versus diseased tissues or treated versus untreated tissues. With one-color systems,
the comparisons occur between gene transcript levels from two or more sets of arrays.
With two-color cDNA arrays, the ratio of two fluorescence intensities, (Red, Green),
for each spot are used to make comparisons (Speed, 2003).
A number of methods have been proposed for detecting DE genes in microarray
experiments since the microarray technique was developed. A straightforward method
is the traditional F/t-statistics (Dudoit et al., 2002; Draghici, 2003) to identify genes
with larger ratios between group mean differences and variances in transcript levels.
However, in a typical microarray data set, the number of genes is large (> 1000)
whereas the numbers of replicates of arrays are small. The performance of the
F/t-statistics depends on the sample size and whether transcript levels are normally
dis-tributed. As the number of replicates within treatments is often small for microarray
experiments and the data may not be normally distributed, the performance of the
F/t-statistics is usually poor. Several modeling approaches have been suggested to
detect significantly expressed genes based on various distributional assumptions, such
test approach (Wang and Ethier, 2004), and a mixture model approach (Pan et al.,
2001). In addition, some researchers have proposed nonparametric methods like the
Mann-Whitney U-test and the Wilcoxon rank test, but the powers of these tests is
smaller than the F/t-statistics because of the small sample size in microarray studies
(Wang and Ethier, 2004).
Recently, Yano et al. (2006) proposed a gene discovery method based on
correspon-dence analysis with a new index for expression ratios and three artificial marker genes.
This method used scores (Coordinates) of the first two or three principle components
to identify genes related to phenotypes or treatments of interest in a 2-dimensional or
3-dimensional subspace. The new index for gene expression ratios is the arctangent of
reciprocal intensity ratio [arctan(1/ratio)] and ranges from 0 to 90. When a ratio is
equal to one, the index is 45. As the gene transcript is increased or decreased, this
index increases or decreases from 45, respectively. To identify differentially expressed
gene transcripts, three artificial genes, ExtraGene1, ExtraGene2, and ExtraGene3, are
added to the dataset from a quantitative trait with two phenotypes (A and B). The
expression ratio of ExtraGene1 is zero in the samples of phenotype A, and the
maxi-mum expression ratio, for example 100, is given to B samples. ExtraGene2 has inverse
gene expression pattern as ExtraGene1. ExtraGene3 shows the same ratio (1.0) in
all samples. ExtraGene1 and ExtraGene2 assist in the discovery of genes
differen-tially expressed in the two phenotypes and ExtraGene3 aids in the identification of
24
In a CA biplot obtained using the new index and the three ExtraGenes, up- and
down-regulated genes have negative and positive scores in Factor1, respectively, and
lie on a line segment between ExtraGene1 and ExtraGene2. This line segment is called
UDL (up/down line). All housekeeping genes and ExtraGene3 lie in the center of
the UDL, the origin of the subspace. A distance from an ExtraGene is estimated and
used to identify up-/down-regulated and housekeeping genes, which is called significant
distance and calculated by the following formula:
d = v u u tχ2 ν,α/ n X i=1 fi, (2.26)
where ν is the degrees of freedom or the subspace dimensions, α is a predefined
sig-nificance level, n is the number of samples, and fi is the new index [arctan(1/ratio)]
for the ExtraGene in the ith sample. For example, Yano et al. applied the confidence
areas to a space of seven dimensions for the breast cancer data which had 62 samples
in each of phenotypes. So the significant distance at the 5% significance level from
ExtraGene3 becomes
d =p14.0671/(45 ∗ 124) = 0.0502.
The genes located outside the confidence area are identified as differentially expressed
2.4
Application of Correspondence Analysis to
Mi-croarray Data
Correspondence analysis is an algebraic technique, which allows for a simultaneous
rep-resentation of two data sets that comprise the rows and columns of a data matrix with
non-negative entries (Greenacre, 1984). It treats rows and columns of the data matrix
in a symmetrical fashion and creates a low-dimensional projection from an originally
high-dimensional data set, revealing interdependencies between two sets of variables.
The biplot is commonly used to display the relationships both between row variables
and between column variables by plotting their projections on two independent
dimen-sions.
CA has been applied over the years in psychometric and economic studies. However
application of CA to microarrays is relatively new. Kishino and Waddell (2000) applied
CA to data consisting of 6500 genes expressed in 40 tumor and 22 normal colon tissue
samples by Alon et al. (1999) to visualize the relationship between genes and tissues.
Correspondence analysis gives most weight to over-expressed genes, and these are best
seen associating with the tissues that over-express them. The genes that are expressed
at similar levels in most tissues are located around the center of the CA biplots, whereas
genes that have high expression level in one tissue (e.g. tissue A) and low average
expression level in the other tissues are drawn to that tissue (e.g. tissue A). Kishino
26
located close together, whereas some tissues from the same tissue types are located far
from each other. Exploring the association among tissues based on gene expression
with correspondence analysis, they indicated that the distances between two tissues
in a CA biplot are determined by the correlation between the two tissues rather than
their types.
Fellenberg et al. (2001) demonstrated the applicability of correspondence analysis
to both time series and non-time series data sets. The time series data was the well
known Saccharomyces cerevisiae cell-cycle synchronization data by Spellman et al.
(1998) which was composed of 800 genes and 73 hybridizations, four methods at various
time points. The data set was translated to the logarithmic ratios of intensities of
two-channel fluorescence signals. To implement CA, Fellenberg first found the minimum
logarithmic ratio from the data, then shifted all data set to positive range by adding
abs(minimum)+1. The missing values were treated as unchanged, i.e. ratio = 1 . The
non-time-series data originated from the authors’ lab and it investigated the transition
of CDC14 from mitosis to G1, for two of four cell cycle stages. The data consisted
of 1400 genes and two-paired experimental conditions. Fellenberg et al. showed that
CA biplots “provide an informative and concise means of visualizing [microarray data
and] are able to discover relationships both among either genes or hybridizations and
between genes and hybridizations” (Fellenberg et al., 2001). For example, in the biplot
obtained by CA applied to the cell-cycle data, the hybridizations were clearly separated
in which they had strong expression, and the genes with similar expression patterns
were clustered together. In addition, CA can be generally applied to microarray data,
either as intensities or log ratios.
Tan et al. (2004) extended and applied CA to high dimensional microarray
time-course data in case-control studies. The features of CA were used to explore the various
time-course gene expression profiles existing in the data, e.g. to identify the differential
time-course expression profiles conditional on the disease status or/and to find shared
time-course gene expression patterns in both the patients and the normal controls. By
coordinating and examining the biplot produced by CA to diabetes data consisting
of two patients and two normal controls, they identified differential and similar
time-dependent gene expression profiles in response to insulin stimulation in a given disease
status and found that the second axis characterized the time-course patterns regardless
of the disease status. With the sample replicates, the bootstrap technique was applied
to CA for assessing the significance of the contribution by each gene and by each
time-course experiment. The contribution is the proportion of the inertia of the lth leading
axis explained by the ith gene, defined as
cil = ri.gil2/λ2l, (2.27)
where ri. is the sum of row i of the matrix A divided by the total of A, gil is the
coordinate of gene i on the lth axis, see the formula for G in Chapter 3, and λ2 l is the eigenvalue in the lth dimension. The expected contribution for each gene is 1/n for
28
all the dimensions, where n is the number of genes. They generated 30, 000 bootstrap
samples and evaluated the significance of the contribution on the leading axes, ci. = P
lcil, for each gene using the bootstrap p−value obtained by pi ≡
P30,000
j I(ci.j ≤
ce)/30, 000, where I(.) is the indicator function, ci.j is the contribution estimated for
gene i in bootstrap sample j, and ce = n1 is the expected contribution. A bootstrap
sample matrix was obtained by taking the average gene expression from two bootstrap
matrices which were each formed by randomly picking with replacement rows from two
control replicates and two diabetic replicates. Genes were ranked by their significance
levels in their contributions on each of the first three axes and top genes were selected
Methods
3.1
Correspondence Analysis
3.1.1
Introduction
Correspondence analysis (CA) is a statistical visualization method for analyzing
sim-ple two-way and multi-way tables containing some measure of correspondence between
the rows and columns. It originated in France and the name was translated from
the French Analyses des Correspondances, where the term correspondence denotes a
“system of associations” between the elements of two sets (Greenacre, 1984). As an
exploratory and descriptive data analytic technique, CA remarkably simplifies complex
data, identifies systematic relations between variables, and provides a detailed
descrip-tion of informadescrip-tion in the data. It was developed for contingency tables early, and then
30
applied to general data matrices. In this chapter, we summarize the main features of
correspondence analysis. Further details can be found in the references (Greenacre,
1984 and 1993).
In a contingency table, interest often centers on the association between rows and
columns. The significance of the association is tested using the Chi-square test, but
this test provides no information as to which are the significant individual associations
between row-column pairs of the data matrix. Correspondence analysis shows how
the variables are related, not just that a relationship exists, by jointly displaying the
rows and columns of a contingency table as points in a constructed orthogonal system
of principal factors. The positions of the row and column points are consistent with
their associations in the table and the distances among the row points and among the
column points are weighted Chi-squared distances. The principal factors or axes are
constructed according to the information they represent and therefore are presented in
a decreasing order of variability. The goal is to have a global view of the data that is
useful for interpretation. The dimension of a contingency table is m-1, where m is the
minimum of the number of rows and columns. The information included in a subspace
of dimension p (p < m − 1) equals the sum of the information included in the p factors.
The average percentage of the total information represented by one factor is
100/(m-1), which serves as a guide in determining the relative importance of a given factor.
In this system, proximity between observations or between variables is interpreted as
as strong association. For microarray data, genes as the variables are in the rows
and treatments or replicates as the observations are in the columns. Correspondence
analysis displays simultaneously observations and variables on the same space making
it easy to discover the salient information included in a given data table.
3.1.2
Basic Concepts and Definitions
The commonly used concepts in correspondence analysis are described below.
Primitive matrix: the original data matrix, N(I ×J), or contingency table, is called the primitive matrix. The elements of this matrix are nij and nij ≥ 0.
Correspondence matrix: the correspondence matrix P is defined as the primitive matrix N divided by the grand total n =PIi=1PJj=1nij,
P= N/n, (3.1)
which shows how one unit of mass is distributed across the cells.
Masses: let r denote row mass, a vector of length I which contains the row sums of
the correspondence matrix. Let c denote column mass, a vector of length J, containing
the sums of columns.
r = P1 and c= PT1, (3.2)
where 1 is a vector of 1’s.
32
Let C and R denote the matrices of row and column profiles respectively, then
R ≡ D−1r P≡ ˜rT 1 ... ˜rT I , C≡ D−1c PT ≡ ˜cT 1 ... ˜cT J , (3.3)
where Dr = diag(r) is a diagonal matrix of row masses, Dc = diag(c) is a diagonal
matrix of column masses, ˜ri is the ith row of P divided by ri, and ˜cj is the jth column
of P divided by ci.
Clouds of points N(I) and N(J): the row cloud of points N(I) are the I row profiles
˜r1. . . ˜rI in J−dimensional space, whose masses are the elements of r. Similarly, the
column cloud of points N(J) are the J column profiles ˜c1. . . ˜cJ in I−dimensional space,
whose masses are the elements of c.
Weighted distances and Chi-square distances: in CA, a variant of Euclidean distance, called the weighted Euclidean distance, is used to measure the distances
between profiles points. The squared distance between two profiles x and y is given by
d2(x,y) = (x − y)T Dq(x − y) , (3.4)
where Dqis the matrix of dimension weights. Dq = D−1c in J-dimensional space defined
by columns, and Dq = D−r1 in I-dimensional space defined by rows. For example, the
squared distance between two points of row i and row i′
in J-dimensional space and
the squared distance between two points of column j and column j′ in I-dimensional
the Chi-square distance. d2(˜ri,˜r i′) = (˜ri− ˜ri′) T D−1c (˜ri− ˜ri′) , (3.5) d2(˜c j,˜cj′) = (˜cj − ˜cj ′)T D −1 r (˜cj − ˜cj′) . (3.6)
As is well-known, the Chi-square statistic, see Equation (3.7), tests whether the
devia-tions of observed frequencies are significant departures from their expected frequencies
assuming independence. This can be rewritten in the form of Chi-square distances in
terms of row and column profiles, as given in Equations (3.8) and (3.9).
χ2 = X (i,j)∈I×J (oij − eij)2/eij = I X i J X j (nij − nricj)2 nricj (3.7) = I X i J X j nri(˜rij − cj)2 cj = n I X i ri(˜ri− c)TD−c1(˜ri− c) (3.8) = I X i J X j ncj(˜cij − ri)2 ri = n J X j=1 cj(˜cj − r)TD−r1(˜cj− r), (3.9)
where oij = nij and eij = nricj, the observed and expected frequencies in the (i, j) cell
of the matrix, respectively.
Total inertia: the quantity χ2/n is defined as the total inertia of the data matrix, which is a measure of total variation of the matrix. From Equations (3.8) and (3.9),
total inertia is given as
Total inertia = I X i ri(˜ri− c)TD−c1(˜ri− c) = in(I) (3.10) = J X j cj(˜cj− r)TD−r1(˜cj− r) = in(J), (3.11)
34
where in(I) and in(J) are total inertia of clouds of points N (I) and N (J), respectively.
The total inertia can be interpreted as the weighted average of squared χ2 distances
between the profiles (rows or columns) and their centers.
3.1.3
Algebra of Correspondence Analysis
Correspondence analysis is a technique to find a low-dimensional approximation to
the original data matrix that graphically represents the weighted χ2 distances between
row (or column) profiles and their average profiles. This low-dimensional subspace, S′
is determined by minimizing the sum of projected distances between each of profiles
to the subspace S′
. Singular value decomposition (SVD) gives the solution for the
optimal subspace with any specified dimensionality. The SVD decomposes any real
I × J matrix S of rank K as the product of three matrices:
S = UDαVT, (3.12) i.e. S = K X k αkukvTk, (3.13) where UTU= VTV= I and α 1 ≥ α2 ≥ . . . ≥ αK > 0. The elements α1, α2, . . . , αK of the diagonal matrix Dαare called singular values of S. The columns of U (u1, u2, . . . , uK)
are called left singular vectors and the columns of V (v1, v2, . . . , vK) are called right
the respective basis vectors in V and U are Equations (3.14) and (3.15), respectively.
F = UDα, (3.14)
G = VDα. (3.15)
The optimal K∗
-dimensional projection S′
is defined by the first K∗
terms of S, i.e. S′ = K∗ X k αkukvTk, (3.16) then S′ minimizes: kS − Xk2 ≡X i X j (sij − xij)2, (3.17)
amongst all I × J matrices X of rank at most K∗
(Greenacre, 1984).
In correspondence analysis, we project the row and/or column profiles of a
ma-trix onto an optimal K∗-dimensional subspace using χ2 distances. The matrix Q =
D−r1/2 P− rcT
D−c1/2 with elements qij = (pij − ricj)/√ricj, called the matrix of standardized residuals, is decomposed. To perform this analysis, the matrix (P − rcT),
P centered by the row and column masses, is decomposed as
P− rcT = ADαBT, (3.18)
using generalized SVD, where ATD−1
r A= BTD−1c B= I and the columns of A and B are called the principal axes of the column and row profiles, respectively. The difference
between SVD and generalized SVD is that SVD does not take account of masses and
di-mension weights while generalized SVD does. Then the matrix D−1/2
r P− rcT
D−c1/2
36
can be represented in the Equation (3.19). The principal coordinates of the row profiles
centered by the column masses R − 1cTand the column profiles centered by the row
masses C − 1rT in terms of their own principal axes and to their singular vectors
are given by the Equations (3.20) and (3.21), respectively.
D−1/2r P− rcTD−1/2c = UDαVT, (3.19)
F= D−1
r ADα = D−r1/2UDα, (3.20) G= D−1
c BDα = D−c1/2VDα, (3.21)
where U = D−1/2r A and V = D−1/2c B. The first K∗ < K columns of F and G
provide the optimal solution in K∗
dimensions and the variation explained by the
K∗
-dimensional subspace is the sum of the first K∗
squared singular values, called
eigenvalues. Table 3.1 summarizes the algebra of correspondence analysis (Greenacre,
1984). The following are the proofs of Equations (3.20) and (3.21).
proof D−1/2 r P− rcT D−1/2 c = D −1/2 r DrR− rcT D−1/2 c = D1/2r R− 1cTD−1/2 c = UDαVT R− 1cT = D−1/2 r UDαVTD1/2c = D−1/2 r UDαBT (V = D−c1/2B) F= D−1/2r UDα = D−1r ADα (U = D−1/2r A)
Similarly, D−1/2r P− rcTDc−1/2 = D−1/2r CTDc− rcT D−1/2c = D−1/2 r CT − r1T D1/2c = UDαVT C− 1rTT = D1/2r UDαVTD−c1/2 C− 1rT = D−1/2 c VDαAT G= D−1/2 c VDα = D−c1BDα
38
Table 3.1: Algebra of correspondence analysis.
Row Column Masses: r= P1 c= PT1 Profiles: R≡ D−1r P≡ ˜rT 1 ... ˜rT I C≡ D−1c PT ≡ ˜cT 1 ... ˜cT J
Clouds of The I row profiles ˜r1. . . ˜rI The J column profiles ˜c1. . . ˜cJ
points: in J−dimensional space in I−dimensional space
Clouds of masses: The I elements of r The J elements of c
Distances: (xr− yr)T D−c1(xr− yr) (xc− yc)TD−r1(xc − yc)
Centroids of cloud : c = RTr r= CTc
Total inertia: in(I) =Piri(˜ri− c)T D−c1(˜ri− c) in(J) =Pjcj(˜cj− r)T D−r1(˜cj − r)
= tracehDr R− 1cTD−1c R− 1cTTi = tracehDc C− 1rTD−1r C− 1rTTi
and in(I) = in(J) = tracehD−1
r P− rcT D−1 c P− rcT Ti Coordinates Principal : F= D−1 r ADα G= D−c1BDα Basis: F= D−r1/2UDα G= D −1/2 c VDα
3.1.4
Projection and Two-Dimensional Maps
Correspondence analysis is a dimension reduction method which identifies an optimal
subspace where the deviations from the expected values can be represented. It
simul-taneously analyzes the row profiles (or row analysis) and column profiles (or column
analysis) of the data matrix. The row or column analysis of a matrix is the projection of
row or column profiles from a multidimensional space onto an optimal low-dimensional
subspace for interpretation of the inter-profile positions. Equations (3.10) and (3.11)
show that the row analysis and column analysis have the same total inertia so that
the both analyses have the same decomposition of inertia into principal inertia along
principal axes.
With the matrices, U, V, and Dα, we can project the row profiles into the new
space defined by V and the column profiles into the new space defined by U. In
general, the projections may take place onto any low-dimensional subspace, but most
of CA displays are two-dimensional with (α2
1 + α22)/
PK
k=1α2k× 100% of total inertia accounted for by the plane of the two axes.
There are two types of two-dimensional displays in correspondence analysis:
sym-metric maps and asymsym-metric maps. In an asymsym-metric map, also called a biplot, one of
the sets of variables (e.g. column variables) are displayed as the extreme vertices
rep-resenting the principal axis directions of the corresponding variables, while the other
(e.g. row variables) is displayed as the profile points. Each vertex can be viewed as a
40
between the two sets of variables as well as the variation amongst the variables of
profile points are apparent in an asymmetric map. The coordinates of the row vertices
and column vertices are given by the Equations (3.22) and (3.23).
Φ = D−1/2
r U, (3.22)
Γ = D−1/2
c V. (3.23)
If the display is of good quality, say more than 85% of the total inertia captured by
the display, the closer a profile is to that vertex, the higher its profile is with respect
to that variable.
In a symmetric map, the clouds of points N (I) and N (J) are overlaid in a joint
dis-play in principal coordinates. The two clouds of points should be interpreted separately
and the association between the rows and the columns cannot be judged graphically
in a symmetric map. The distance between any two row (or column) profiles is the
Chi-square distance, as mentioned previously. The magnitude of this distance between
two points depends on how similar the two profiles are in their shapes rather than their
absolute values. The center of the map is the average profile so the further a point
is from the center the bigger the difference between their profiles. By simultaneous
inspection of a symmetric map and an asymmetric map, we can easily identify genes
that are differentially expressed under a certain treatment.
To illustrate correspondence analysis, let us consider a simple example. The data
of different employees in a company. This is a 5 × 4 table, thus the dimension of the
Table 3.2: Matrix of artificial data.
Smoking category Row
Staff group None Light Medium Heavy totals
(SM)Senior managers 4 2 3 2 11 (JM)Junior managers 4 3 7 4 18 (SE)Senior employees 25 10 12 4 51 (JE)Junior employees 18 24 33 13 88 (SC)Secretaries 10 6 7 2 25 Column totals 61 45 62 25 193
table is 3 and its row profiles and column profiles lie in three-dimensional spaces.
Assume that we are interested in the number of the staff in each group or in each
smoking category out of the sample of 193; for example, the group of JE has the
largest sample size and has the most heavy smokers. However, these marginal views
do not describe the patterns of the smoking habits. Table 3.3 lists the relative row
Table 3.3: Relative frequencies of smoking categories within staff-groups.
Smoking category Row
Staff group None Light Medium Heavy totals
(SM)Senior managers 0.3636 0.1818 0.2727 0.1818 1 (JM)Junior managers 0.2222 0.1667 0.3889 0.2222 1 (SE)Senior employees 0.4902 0.1961 0.2353 0.0784 1 (JE)Junior employees 0.2045 0.2727 0.3750 0.1477 1 (SC)Secretaries 0.4000 0.2400 0.2800 0.0800 1 Average 0.3161 0.2332 0.3212 0.1295
frequencies, which are called row profiles. From this table it is easy to compare the
distribution of the staff of each group within each smoking category; for example,
42
this group and over all five groups, and senior managers and junior managers have
relatively higher proportions of heavy smokers over all groups. If we visualize this data
A Axis_1 0.0748 ( 87.76 % ) Axis_2 0.01 ( 11.76 % ) -2 -1 0 1 -2 -1 0 1 SM JM SESC JE None Light Medium Heavy Staff groups Smoking habits B Axis_1 0.0748 ( 87.76 % ) Axis_2 0.01 ( 11.76 % ) -0.4 -0.2 0.0 0.2 -0.4 -0.2 0.0 0.2 SM JM SE JE SC None Light Medium Heavy
Figure 3.1: Two-dimensional displays obtained by correspondence analysis of the smok-ing habit data. A: the asymmetric map; B: the symmetric map.
graphically, the row profiles can be displayed in a four-pointed tetrahedron, which
can be observed in a three-dimensional space of the points. Figure 3.1 is the
where (A) is the asymmetric map showing the projections of row profiles (in principal
coordinates) and the four column vertex points (in standard coordinates) and (B) is
the symmetric map showing the projections both by the row profiles and by the column
profiles. The triangles denote the smoking category and the dots correspond to staff
groups. The inertias explained by the first two principal axes are 0.0748(87.76%) and
0.0100(11.76%), respectively. The cumulative inertia in the plane is 0.0848 that is
99.5% of the total inertia.
Looking at the positions of the profiles (dots) relative to the vertices (triangles) in
Figure 3.1 (A), we see that three smoking categories are on the right and the
nonsmok-ing category is on the left. This means that the right-to-left distinction is tantamount
to smokers versus nonsmokers. The groups JE and JM have relatively more medium
smokers than SE does. The actual data are that 14.77% of JEs and 22.22% of JMs are
medium smokers whereas 7.84% of SEs are medium smokers, which agrees with the
interpretation of Figure 3.1 (A).
In Figure 3.1 (B), we notice that the coordinates of points of column profiles are
scaled differently from that in (A), i.e. gk = αkγk, where gk is the kth column of G
and γk is the kth column of Γ. Figure 3.1 (B) is a joint display in which the separate
configurations of row profiles and column profiles are overlaid. The distance between
the row points (or the column points) approximates the inter-row χ2 distance, which
is a measure of similarity between the row (or the column) profiles. The groups JM
44
JM is near the group JE since they are relatively similar to each other in terms of
their smoking habits. Distances between the points representing smoking categories
are interpreted in the same way.
The center of Figure (A) or (B) is the average profile. The deviations of the staff
groups can be considered as outwards from the average profile in different directions.
3.2
Algorithm for Identifying DE Genes and
Out-liers
As mentioned in the previous section, the center of a biplot is always the average
profile and the distances between row (column) profile points and the center of a
biplot are the measure of deviations of rows (columns) from their average profile. One
of our goals is to estimate a region in the biplot surrounding the center such that
one can consider with some level of probability that the points which lie outside the
region are differentially expressed genes or outliers. Figure 3.2 indicates graphically
what is desired, which is the biplot for simulated microarray data consisting of three
arrays (three treatments or three replicates of a condition). The ellipse describes a
confidence region with content 0.95. When comparing treatments, the genes simulated
with significantly higher/lower intensities in one array than the others lie on or outside
the ellipse; the higher/lower the intensity, the further from the center in the direction or