Applications of correspondence analysis in microarray data analysis.

(1)

Applications of Correspondence Analysis

in Microarray Data Analysis

by Ruixia Mu

B.Eng., Tianjin University, China, 1983

A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of

MASTER OF SCIENCE

in the Department of Mathematics and Statistics

c

Ruixia Mu, 2008 University of Victoria

(2)

Applications of Correspondence Analysis

in Microarray Data Analysis

by

Ruixia Mu

B.Eng., Tianjin University, China, 1983

Supervisory Committee Dr. Mary Lesperance, Supervisor

(Department of Mathematics and Statistics)

Dr. Bill Reed, Departmental Member

Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)

Dr. Caren C. Helbing, External Examiner (Department of Biochemistry and Microbiology)

(3)

iii

Supervisory Committee

Dr. Mary Lesperance, Supervisor

Dr. Bill Reed, Departmental Member (Department of Mathematics and Statistics)

Dr. Julie Zhou, Departmental Member (Department of Mathematics and Statistics)

Dr. Caren C. Helbing, External Examiner (Department of Biochemistry and Microbiology)

ABSTRACT

Correspondence analysis is a descriptive and explorative technique for the study of asso-ciations between variables. It is a visualization method for analyzing high dimensional data via projections onto a low-dimensional subspace. In this thesis, we demonstrate the applicability of correspondence analysis to microarray data. We show that it can be used to identify important genes and treatment patterns by coordinating and project-ing the genes and the experimental conditions. In addition, we estimate missproject-ing values in the gene expressions using the Expectation-Maximization (EM) algorithm and iden-tify genes with large between-condition variability using the projections of the genes and the conditions. To demonstrate its application, correspondence analysis is applied to various simulated data and microarray data from the EPA (Environmental Protec-tion Agency) studies. We conclude that correspondence analysis is a useful tool for analyzing the associations between genes and experimental conditions, for identifying important genes, and for estimating missing values.

(4)

Supervisory Committee ii

Abstract iii

Table of Contents iv

Acknowledgments xiii

1 Introduction 1

1.1 Background on DNA Arrays . . . 2

1.2 Data Sources . . . 5

1.3 Summary of Thesis . . . 6

2 Literature Review 8 2.1 Quality Control for DNA Microarrays with Replicates . . . 8

2.2 Estimation of Missing Values . . . 10

2.2.1 Notation . . . 11 2.2.2 SVDImpute . . . 12 2.2.3 KNNImpute . . . 14 2.2.4 LSImpute . . . 15 2.2.5 OLSImpute . . . 18 2.2.6 PLSImpute . . . 18 iv

(5)

v

2.2.7 LLSImpute . . . 21

2.3 Identification of Differentially Expressed Genes . . . 22

2.4 Application of Correspondence Analysis to Microarray Data . . . 25

3 Methods 29 3.1 Correspondence Analysis . . . 29

3.1.1 Introduction . . . 29

3.1.2 Basic Concepts and Definitions . . . 31

3.1.3 Algebra of Correspondence Analysis . . . 34

3.1.4 Projection and Two-Dimensional Maps . . . 39

3.2 Algorithm for Identifying DE Genes and Outliers . . . 44

3.3 EM Algorithm for Missing Value Estimation . . . 48

4 Simulation Study 52 4.1 Determination of Simulation Models . . . 53

4.2 Simulation Experiments . . . 70

4.2.1 Experiment #1 . . . 72

4.2.2 Experiment #2 . . . 75

4.2.3 Experiment #3 . . . 85

4.2.4 Experiment #4 . . . 87

5 Application to Real Microarray Data 93 5.1 Datasets . . . 94

5.2 Missing Value Estimation . . . 96

5.3 Identification of Differentially Expressed Genes . . . 98

6 Discussion and Conclusion 103 6.1 Application on Elucidating the Association between Genes and Treatments103 6.2 Application on Imputing Missing Values . . . 104

(6)

6.3 Application on Identifying DE Genes or Outliers . . . 107

Bibliography 111 Appendix 116 A Results of Simulations 116 A.1 Graphs of Correspondence Analysis for Experiment #1 . . . 116

A.2 Results of Simulation Studies . . . 144

B Script Files 147 B.1 Computational Program of Correspondence Analysis . . . 148

B.1.1 Original CA computation . . . 148

B.1.2 Modified CA computation . . . 149

B.1.3 Computational program of DE genes identification . . . 150

B.2 Programs of Missing Value Imputation . . . 151

B.2.1 CAEMImpute . . . 151

B.2.2 KNNImpute . . . 153

B.2.3 OLSImpute . . . 154

B.2.4 PLSImpute . . . 155

B.3 Programs of Data Simulation . . . 158

B.3.1 Function for simulating ratio data . . . 158

B.3.2 Function for simulating intensity data . . . 158

B.3.3 Function for simulating DE data . . . 159

B.4 Scrip Files of Simulation Experiments . . . 161

B.4.1 Experiment 1 . . . 161

B.4.2 Experiment 2 . . . 165

B.4.3 Experiment 3 and Experiment 4 1 . . . 169

(7)

vii

(8)

List of Tables viii

3.1 Algebra of correspondence analysis. . . 38

3.2 Matrix of artificial data. . . 41

3.3 Relative frequencies of smoking categories within staff-groups. . . 41

4.1 Numbers of distinct genes in each of data sets to be examined. . . 54

4.2 Cut points and percentages of genes having high intensities in each of treatments in the EPAA study. . . 54

4.3 The estimates of parameters for log-normal and uniform distributions for each data set and averages over the data sets. . . 63

4.4 Numbers of genes in each of time points. . . 66

4.5 The estimates of parameters for normal and exponential distributions for each data set and averages over the data sets. . . 66

4.6 Factors and levels for the simulation Experiment #1. . . 73

4.7 The proportions of inertia displayed in the first two dimensions of CA. 74 4.8 Factors and levels for simulation Experiment #2. . . 76

4.9 The ratios of the ExtraGenes. . . 77

5.1 Summary of some characteristics of the EPAA and EPAB intensity data from brain tissues under 24 hours. . . 96

(9)

ix

5.2 RMSEs from the CAEMImpute and KNNImpute for the intensity data

with different percents of missing values. . . 98

A.1 Effect of the sample size on TPR and FPR of the DE genes identified. . 144

A.2 Effect of the proportion of DE genes on TPR and FPR of the DE genes identified. . . 144

A.3 Effect of the number of treatments where a DE gene is multiply differ-entially expressed on TPR and FPR of the DE genes identified. . . 145

A.4 Effect of the proportion of the multiply expressed genes on TPR and FPR of the DE genes identified. . . 145

A.5 Effect of the proportion of down-regulated genes on TPR and FPR of the DE genes identified. . . 145

A.6 TPR and FPR of the Experiment #3. . . 146

A.7 TPR and FPR of the Experiment #4 1. . . 146

A.8 RMSE of the Experiment #4 2. . . 146

C.1 Missing value imputations for the EPAA study. . . 173

C.2 Missing value imputations for the EPAB T3 study. . . 174

C.3 Missing value imputations for the EPAB T4 study. . . 176

C.4 Names and ratios of DE genes identified from the EPAA data. . . 179

C.5 Names and ratios of DE genes identified from the EPAB T3 data. . . . 181

(10)

List of Figures x

3.1 Two-dimensional displays obtained by correspondence analysis of the

smoking habit data. A: the asymmetric map; B: the symmetric map. . 42

3.2 Two-dimensional displays with confidence ellipse for a simulated mi-croarray experimental data. . . 45

4.1 Histograms for the EPAA study data at t = 24 hours. . . 55

4.4 Plots for the EPAA study, the control t = 24 hours. . . 57

4.5 Plots for the EPAA study, the Methimazole t = 24 hours. . . 57

4.6 Plots for the EPAA study, the Propulthiouracil t = 24 hours. . . 58

4.7 Plots for the EPAA study, the Perchlorate t = 24 hours. . . 58

4.10 Plots for the EPAA study, the Propulthiouracil t = 48 hour. . . 60

4.14 Plots for the EPAA study, the Propulthiouracil t = 96 hours. . . 62

(11)

xi

4.16 Histograms and normal Q-Q plots for the LPS log2(ratio) data. . . 65

4.17 Plots for not regulated genes of the LPS log2(ratio) data. . . 67

4.18 Plots for down regulated genes of the LPS log2(ratio) data. . . 68

4.19 Plots for up regulated genes of the LPS time-course log2(ratio) data. . 69

4.20 Plots of ATPR against the proportions of DE genes. . . 79

4.21 Plots of AFPR against the proportions of DE genes. . . 79

4.22 Plots of ATPR against the number of treatments. . . 80

4.23 Plots of AFPR against the number of treatments. . . 80

4.24 Plots of ATPR against the proportions of MDE genes. . . 82

4.25 Plots of ATPR against the proportions of MDE genes. . . 82

4.26 Plots of ATPR against the proportions of down-regulated genes. . . 84

4.27 Plots of ATPR against the proportions of down-regulated genes. . . 84

4.28 Plots of (a) ATPR and (b) AFPR in Experiment# 3 against the number of replicates. . . 88

4.29 Plots of ATPR (a) and AFPR (b) in Experiment# 4 against the number of replicates. . . 88

4.30 Plots of REMS against the number of treatments, (a) the KNNImpute method (b) the OLSImpute method. . . 91

4.31 Plots of REMS against the number of treatments between four imputa-tion methods. . . 91

5.1 CA plots for the EPAA study in the first two dimensions. . . 99

5.2 CA plots for the EPAB T3 study in the first two dimensions. . . 100

5.3 CA plots for the EPAB T4 study in the first two dimensions. . . 101

A.1 CA plots for N = 100, M = 3, and PDE = 0.05. . . 117

(12)

(13)

xiii

Acknowledgment

I would like to begin by thanking my supervisor Dr. Mary Lesperance for her patience, support, and guidance in countless ways. I feel very fortunate to have had Dr. Lesperance as my supervisor. I also would like to thank Dr. Caren C. Helbing and all members of our research team for their support and friendship.

Many thanks go to the faculty and staff of the Department of Mathematics and Statistics of the University of Victoria for providing financial support, a stimulating work environment and inspiration.

Finally and most importantly, I would like to thank my husband and daughter for their love, help, and support.

(14)

Introduction

DNA array technology was developed to study the expression levels of thousands of

genes in a biological sample simultaneously. It has been applied in numerous studies

over a broad range, from the study of gene expression in yeast under different

envi-ronmental conditions to the comparison of gene expression profiles for tumors from

cancer patients (Dudoit et al., 2002). In addition to the potential to help advance

biological knowledge at a genomic scale, microarrays have very important applications

in pharmaceutical and clinical research (Draghici, 2003). However, one remaining

chal-lenge is how to analyze and interpret the resulting large amounts of data generated by

microarray experiments. A common task in analyzing microarray data is to identify

genes with altered expression in samples drawn from two different tissues or at two

different time points or conditions. As the biological systems are usually dynamic and

interactive, it is better to use a distribution-free method to capture and present the

(15)

2

diverse expression patterns.

Correspondence analysis (CA) is a descriptive/exploratory technique designed for

investigating the association between row and column variables by graphically

display-ing the patterns and structures existdisplay-ing in the data. It has been widely applied on

categorical data in the social sciences, environmental science, and marketing research

(Blasius et al., 1998). This method has recently been introduced to the analysis of

microarray data (Kishino and Waddell, 2000; Fellenberg et al., 2001; Tan et al., 2004;

Yano et al., 2006) to uncover the associations between genes and tissues/treatments

based on gene expressions. One major requirement of CA is that the data matrices

have no missing values. In this study, we explore the feasibility of applying the CA

technique to the identification of differentially expressed (DE) genes, the estimation of

missing values, and the examination of the quality of replicate microarrays.

1.1 Background on DNA Arrays

A new area of genome research called functional genomics has been developed as

re-searchers continually increase the rate at which genomes are sequenced. Functional

genomics aims to discover the biological function of DNA sequences. As the complete

DNA sequences of many genomes are already known, an important and difficult task

is to define the role of each gene and understand how the genome functions as a whole.

(16)

developed to exploit DNA sequence data and yield information about gene expression

levels for entire genomes (Draghici, 2003).

DNA consists of two strands of linked nucleotides with one of the four bases:

ade-nine (A), thymine (T), guaade-nine (G) and cytosine (C). These four bases form the genetic

alphabet. The genetic information is encoded in strings of variable length formed with

letters from this alphabet and is stored in various very long strings of DNA. The two

strands of the DNA molecule are held together by hydrogen bonds according to the

base-pairing rules: G pairs with C and A pairs with T. For example, if one strand

has the sequence A-G-C-C-G-T-A-T-C, the other strand will have the complementary

sequence T-C-G-G-C-A-T-A-G. Various substrings of DNA constitute functional units

called genes, and contain information necessary to construct proteins which are the

ul-timate expression of the genetic information. The expression of the genetic information

stored in the DNA molecule occurs in two stages: transcription and translation. During

transcription, the genetic information contained in a gene from the DNA is converted

into messenger ribonucleic acid (mRNA). mRNA is a single-stranded complementary

copy of the base sequence in the DNA molecule, with the base uracil (U) replacing

thymine (T) (e.g. U-C-G-G-C-A-U-A-G). Translation is the process of converting the

message from the 4 letter RNA alphabet to the 20 letter alphabet of amino acids used

to build proteins (Berrar et al., 2003) according to a code in which each amino acid

is represented by three bases in mRNA. The types and amounts of protein generated

(17)

4

changes in gene expression levels can give rise to major changes at the organismic

level and induce illnesses such as cancer. Therefore, comparing the expression levels of

various genes between different conditions is of extreme interest to life scientists.

Microarray technology has been developed to investigate the properties of gene

ex-pression at the transcription or translation levels as well as subcellular localization of

gene products. Commonly, attention has focused primarily on expression at the

tran-scription stage, i.e. on mRNA levels, because mRNA levels sensitively reflect the type

and state of the cell. Microarrays make use of a key property of the DNA duplex - the

sequence complementarity of the two strands, which makes hybridization, a chemical

reaction, possible. During hybridization, two complementary nucleic acid strands, for

example A-G-C-C-G-T-A-T-C and T-C-G-G-C-A-T-A-G, combine to form a

double-stranded molecule according to the base-pairing rules. To utilize the hybridization

property of DNA, complementary DNA (cDNA) (e.g. A-G-C-C-G-T-A-T-C) is

ob-tained from mRNA (e.g. A-G-C-C-G-U-A-U-C) by reverse transcription.

There are two main types of cDNA microarrays: nylon membrane-based filter arrays

and chemically coated glass-based arrays (Berrar, 2003). The nylon membrane arrays

are hybridized with radioactively labeled cDNA targets, and glass arrays are hybridized

with fluorescent dye-labeled targets. In a microarray experiment, samples of DNA

clones with known sequence content are spotted and immobilized on to a glass slide

or nylon filter. The mRNA extracted from the tissue cell under study is purified,

(18)

fluorescent dyes (e.g. a red-fluorescent dye Cy5 and a green-fluorescent dye Cy3).

Labeled cDNA hybridizes to the spots containing complementary sequences on the

array. After hybridization, the radioactive or fluorescent signal intensities are measured

using a phosphorimager or laser scanner, respectively. One intensity is measured on

each spot for the radiation-labeled array (one-channel array) while two intensities are

measured on each spot for the fluorescence dye-labeled array (two-channel array). In

both cases, the intensities are used as the expression levels of genes in the sample under

study (Draghici, 2003). In this study, we primarily use data from one-channel arrays,

but the results are applicable to two-channel arrays.

1.2 Data Sources

In this thesis, two kinds of data, real and simulated microarray experimental data, are

used to demonstrate the applications of CA. The real microarray data are three of a

series of data for the US Environmental Protection Agency (EPA) from Dr. Caren

Helbing’s lab. The EPA series consists of two sets of experiments, EPAA and EPAB,

using Xenopus laevis frog tadpole brain, leg, and tail tissues. The EPAA studies are

these tissues exposed to three different thyroid hormones (TH) inhibitors: methimazole

(Meth), propylthiouracil (PTU), and perchlorate (Per), along with a control condition

(Con) for 24, 48, and 96 hours. The EPAB data are obtained from the same tissues

(19)

6

3,5,3’-triiodothyroine (T3), along with their corresponding control conditions, for 24,

48, and 96 hours (Helbing et al., 2007). Three data sets used in our study are the

brain tissue of the EPAA, EPAB (T3) and (T4) studies at 24 hours. More details about

experimental procedures and data processes are given in Chapter 5.

The simulated data are generated according to the distributions and the factors

en-countered in real biological microarray experiments. The main factors are the number

of genes and treatments, as well as the proportion of DE genes. More details of the

simulation study are given in Chapter 4.

1.3 Summary of Thesis

This thesis is organized as six main chapters. Chapter 1 provides a brief scientific

background on microarray technology and a summary of the data sets used in this

study.

Chapter 2 provides a literature review of different statistical methods used to

iden-tify differentially expressed genes, missing value estimation, quality investigation of

replicated microarray data, and applications of correspondence analysis to microarray

data.

Theoretical details of correspondence analysis are described in Chapter 3. The

algorithm used for missing value estimation, the computational methods proposed for

(20)

of replicated arrays are covered in this chapter.

Chapters 4 and 5 give details about the implementation of each method listed in

Chapter 3. The simulation studies together with the analysis are presented in Chapter

4 and the application to real microarray data is provided in Chapter 5. Chapter 6

(21)

Chapter 2 Literature Review

2.1 Quality Control for DNA Microarrays with

Repli-cates

The most important task before analyzing microarray experimental data is to review

the quality of the data obtained from each microarray experiment, on a global array

basis and also on an individual spot basis. Since a microarray experiment involves

a large number of steps, from experimental setup including array preparation and

RNA sample preparation, hybridization to image analysis, cDNA microarray data often

contains unusual observations. Quality assured data are required for any reliable

high-level analysis. In addition, quality control allows the experimenter to monitor and

improve the laboratory procedures.

(22)

Dumur et al. (2004) studied quality control criteria used during sample preparation,

e.g. RNA sample isolation and cDNA synthesis products. Many methods, such as

image quantification, data filtering, transformation, and normalization of arrays, have

been proposed for improving data quality on the global array basis (Smyth et al.,

2002; Berrar et al., 2003; Draghici, 2003). Data filtering methods assign pass/fail

criteria to the quality control measures computed, which are used to filter out unusual

spots, such as overlapping spots, contamination, saturated measurements, and to detect

poor quality spots. Some simple filtering methods are applied for quality control on

replicated expression levels of a particular gene. For example, some researchers set a

cut-off value for the standard deviation for each gene between replicate arrays to filter

out unreliable genes (Helbing et al., 2007; Mookherjee et al., 2006).

Gottardo et al. (2006) proposed a way to filter and flag gene and array outliers

based on the parameter estimates from a Bayesian hierarchical model. The Bayesian

hierarchical model is developed to estimate the expression levels of genes in a robust

way, in which outliers are modeled using a t-distribution; the non-constant variances

and design effects are used to measure the quality of replicates. They show that

this model works well with four replicates of microarrays but it requires considerable

computing time since the estimation of parameters in the model is performed using

(23)

10

2.2 Estimation of Missing Values

Gene expression data from microarray experiments usually have missing values for

var-ious reasons, e.g. image corruption, insufficient resolution, or simple dust or scratches

on the slide (Troyanskaya et al., 2001). However, many multivariate statistical

anal-ysis methods, such as correspondence analanal-ysis (CA), principal components analanal-ysis

(PCA), and singular value decomposition (SVD) (Alter et al., 2002), cannot be applied

to data with missing values. Missing value estimation is an important preprocessing

step (Berrar et al., 2003) in microarray data analysis. Because of the cost and time

to repeat a microarray experiment, there are several simple methods for dealing with

missing values, such as ignoring the entries containing missing values, replacing missing

values by zero or imputing missing values using expression averages or medians across

all arrays (Berrar, 2003). Current research shows that missing value estimation can be

significantly improved by considering the correlation between data. Several methods

including the K-nearest neighbors method (KNNImpute), the SVD method

(SVDIm-pute) (Troyanskaya et al., 2001), least squares imputation (LSIm(SVDIm-pute) (Bo et al., 2004),

ordinary least squares (OLSImpute), partial least squares (PLSImpute) (Nguyen et al.,

2004), and local least squares imputation (LLSImpute) (Kim et al., 2005) have been

(24)

2.2.1 Notation

Here, we introduce some notation commonly used with imputation algorithms.

Let A be an n x m matrix representing microarray data with n genes and m arrays

and akj be the expression of gene k under array j. Denote the expression matrix

without gene i as A−i and the expression matrix without array j as A(−j).

A particular gene, say gene i, with missing values to be imputed is called the target

gene and its expression is denoted by a vector ai. Let a length l vector, v, denote the

set of arrays in which a target gene has missing values, thus, the available expression

values and the missing values of the target gene i are expressed as ai(−v) and ai(v), an

empty vector.

All genes with available values in the array v corresponding to a missing value aiv

make up the set of candidate genes for estimating the missing value, aiv, and is denoted

by X. Let G represent the matrix of the original expression of K significant candidate

genes for a particular target gene i, G(−v) denote the matrix of G without the arrays

(25)

12

expressions under array v, given by (2.1), (2.2), and (2.3), respectively.

G =         g1,1 · · · g1,m ... . .. ... gK,1 · · · gK,m         K×m , (2.1) G(−v) =         g1,1 · · · g1,vj−1 g1,vj+1 · · · g1,m ... . .. ... ... . .. ... gK,1 · · · gK,vj−1 gK,vj+1 . .. gK,m         K×m−l , (2.2) g_(v) = (g1,v g2,v · · · gK,v)T, (2.3) for 1 ≤ j ≤ l.

2.2.2 SVDImpute

SVDImpute (Troyanskaya et al., 2001) applies singular value decomposition to obtain

a set of K significant genes, called eigengenes. The expressions of these eigengenes

can be linearly combined to approximate the expression of all genes in the data set.

To perform SVD on the incomplete matrix, replace all missing values in the original

matrix A with the corresponding row averages, denoted as A′

, and then perform SVD

on A′

to obtain

A′

n×m = Un×nΛn×mVTm×m, (2.4)

where UT_U _{= I}

n×n and VTV = Im×m, i.e. U and V are orthogonal. The columns of U are the left singular vectors (gene coefficient vectors); Λ is a diagonal matrix

(26)

containing singular values; and the rows of VT _{are the right singular vectors (expression}

level vectors). By sorting the genes based on their corresponding singular values in

descending order, the first K significant genes are selected, with K < m − l, to form

the matrix G.

A linear regression model with the target gene as the response variable and

expres-sions of G(−v) as predictors is fitted, as in Equation (2.5). The missing value aiv is

predicted using the fitted regression model, given in Equation (2.6).

aT_i(−v) = 1T GT_(−v)             β0 β1 ... βK             , (2.5) baiv = 1 gT (v)             b β0 b β1 ... b βK             . (2.6)

This procedure is repeated on the new matrix A′

, where all imputed missing values

are replaced by new estimates, until the total change of the matrix is not greater than

an empirical threshold of 0.01, i.e. Pn_i=1Pm_j=1_|a′(nit+1)

i,j − a

′_n it

i,j | ≤ 0.01, where nit is the number of iterations.

(27)

14

2.2.3 KNNImpute

KNNImpute (Troyanskaya et al., 2001) is proposed as a robust and sensitive method

for missing value estimation. This method aims to find K genes from the candidate

gene set X(−v) so that their expression levels are most similar to ai(−v). The similarity

is measured by an Euclidean distance defined as

d2_iz = m−l X j=1

(aij − xzj)2, (2.7)

where m − l is the number of available expression levels of the target gene i, and aij

and xzj are the expression levels of gene i and gene z in array j, for z = 1, · · · , Z

and z 6= i, with Z denoting the number of candidate genes. The estimated value for a

missing value aiv is the weighted average of expression values of these K closest genes

in array v, baiv = K X k=1 wkgkv, (2.8)

where wk = 1/(dikC) are the weights and C = Pd−_ik1 is the normalizing weight

con-stant, for k = 1, ..., K. Weights are the inverse of the distances, thus, giving higher

weights to expression values of candidate genes more similar to the target gene i.

Troyanskaya et al. (2001) found that in most cases, KNNImpute, with K in the

range from 10 to 20, performs better and is more robust on non-time series data or

time-series data with no obvious global correlation structures. On the other hand,

SVDImpute works well on time series data with low noise levels and with strong global

(28)

2.2.4 LSImpute

LSImpute (Bo et al., 2004) tries to adaptively exploit the correlations between genes

and between arrays to estimate the missing values based on the least squares

regres-sion model. This algorithm first uses two basic methods, LSImpute gene and

LSIm-pute array, for imputation of missing values. Then it produces weighted averages of

the estimates from the LSImpute gene and LSImpute array methods.

LSImput gene considers the correlations between candidate genes and the target

gene i, in which the arrays are observations. The first step in the LSImpute gene

method is, from the set of candidate genes, selecting K genes most correlated with the

target gene i based on the absolute correlation values to form the expression matrix G.

The next step is fitting K simple linear regression models from each of the K selected

genes with available expression levels of the target gene i, given in Equation (2.9).

aT_i(−v) = 1T gT_k(−v)     αk βk     , (2.9)

where gk(−v) is the kth most correlated gene excluding the arrays v where the

tar-get gene i has missing values, for k = 1, . . . , K. From the K models, K estimates

(29)

16

average of the estimates is obtained using Equation (2.11).

baivk = bαk+ bβkgkv, (2.10) baiv = K X k=1 wkbaivk, (2.11) with wk = ( r2 aigk 1 − r2 aigk+ ǫ )2, (2.12)

where wk is the weight of the gene gk, raigk is the estimated correlation between the

target gene i and the kth gene in the set of K selected genes, and ǫ = 10−6 _{in case of}

r2

aigk = 1.

In the LSImpute array method, multiple regression is applied based on the

corre-lation between arrays and now the genes are the observations. Assume that the target

gene i has l missing values in the first l arrays ai(v) = [ai,1, . . . , ai,l] and m − l

non-missing values ai(−v) = [ai,l+1, . . . , ai,m], where m is the total of arrays. The missing

values in the data set A are substituted by the estimates from LSImpute gene method

to implement LSImpute array method, denoted as A′

. The imputed expression vector

a_i(v) is denoted as a′

i(v). Let a′(v) = [a1, · · · , al] and a′(−v) = [al+1, · · · , am] be the vectors of column means of A′

, where aj is the mean expression of the array j. Then

the least squares estimates of ai(v) given ai(−v) is

b aT_i(v) = a′ T (v)+ Sa′T i(v)ai(−v)S −1 aT i(−v)ai(−v) (aT_i(−v)_{− a}′ T (−v)), (2.13)

(30)

with S_a′T i(v)ai(−v) =         sa′

i,1ai,l+1 · · · sa′i,1ai,m

... . .. ... sa′

i,lai,l+1 · · · sa′i,lai,m

        , (2.14) Sai(−v)ai(−v) =        

sai,l+1ai,l+1 · · · sai,l+1ai,m

· · · . .. · · · sai,mai,l+1 · · · sai,mai,m

        , (2.15) and sxy = 1 n − 1 n X i=1 (xi− x)(yi− y). (2.16)

In the above Equations, the elements sxy’s of Sa′T

i(v)ai(−v) and SaTi(−v)ai(−v) are empirical

covariances between array x and array y. Finally, a missing value aivis estimated using

the LSImpute gene estimate baiv gene and the LSImpute array estimate baiv array by

baiv= pbaiv gene+ (1 − p)baiv array, (2.17)

where p is a mixing coefficient and p ∈ [0, 1]. The value of p is determined by minimizing

the sum of squared errors obtained by re-estimating 5% of the known values and

treating them as missing: X

e2_c = X[pegene+ (1 − p)earray]2, (2.18)

where egene = baiv gene− aiv,

earray = baiv array− aiv.

Here aivis the observed value. This algorithm is favorable compared with the

(31)

18

respect to running time as the number of missing values increase (Bo et al., 2004).

2.2.5 OLSImpute

OLSImpute (Nguyen et al., 2004) is similar to the regression method, LSImpute gene

of LSImpute (Bo et al., 2004). Rather than using the absolute correlation as the

criterion to select K significant candidate genes, OLSImpute selects the K genes based

on Euclidean distance. An ordinary least squares regression is fitted for each of the K

selected genes, which is the same as the Equation (2.9) in the LSImpute gene method.

One of K estimates of a missing value aiv is predicted using Equation (2.10). The final

estimate of aiv is the weighted average of the K separate estimates,

baiv = K X k=1

wkbaivk. (2.19)

The weights can be based on the distance used to select the K candidate genes, wk =

1/(dk P

(1/dk)). If equal weight is desired for each of the K separate estimates, then

wk= 1/K.

2.2.6 PLSImpute

Partial least squares (PLS) imputation (PLSImpute) (Nguyen et al., 2004) is a

regres-sion method with a sequence of gene components as predictors and the target gene as

response variable. The gene components are constructed using the expression levels of

(32)

values are predicted using the fitted regression model. As defined before, ai(−v) is a

vector of available values of target gene i, ai(v) a vector of missing entries to be

im-puted. The expression matrix of the candidate genes is denoted as X which can be

partitioned according to the available values and missing values of gene i as follows,

ai = ai(−v) ai(v) , X = X(−v) X(v) ,

where X(−v) is a matrix of available values corresponding to ai(−v) and X(v) consists

of available values corresponding to the missing values ai(v) of target gene i. The pair

(XT

(−v), aTi(−v)) is the training data and X(v) is the test data used to predict the missing

values ai(v).

Since the number of samples (arrays) is much smaller than the number of available

genes in the training data, PLS reduces dimensions by extracting the gene components

sequentially. The gene components are the linear combinations of the set of candidate

genes with maximum covariance with the target gene i. The kth gene component for

the target gene i is given as tik(−v) = XT_(−v)wik, where wik is a weight vector satisfying

the following criterion,

wik = max[cov2(XT(−v)wik, aTik(−v))], (2.20)

where w′

ikwik = 1. wik is subject to the orthogonality constraints

wT_ikSwid = 0, for all 1 ≤ d < k, (2.21)

(33)

20

A linear regression model is fitted with the available expression levels of a target

gene i as the response and K PLS gene components as predictors,

aT_i(−v) = 1T ti1(−v) · · · tiK(−v)             β0 β1 ... βK             , (2.22)

where tik(−v) is the kth gene component and βk’s are the least squares regression

coefficients of the target gene i. The weights and the test data X(v) are used in

constructing the test PLS components to impute the missing values ai(v). The kth test

PLS component is given as tik(v) = XT_(v)wik. The missing values are predicted using

the fitted model,

b aT_i(v) = 1T ti1(v) · · · tiK(v)             b β0 b β1 ... b βK             . (2.23)

In this method, the missing values in matrix X(−v) are allowed and replaced by the

estimates from KNNImpute before constructing PLS gene components.

Nguyen al et. (2004) provided all the algorithms of KNNImpute, OLSImpute, and

PLSImpute and evaluated their performances measured by the normalized relative

estimation error (RAE),P_i_{(|y − ˆy|)/y, where y is the vector of true expression values,}

(34)

found that the KNNImpute method performs well on the average for the microarray

data sets they used and good accuracy in imputing missing values can be achieved

using 3 or 4 PLS components in PLSImpute method.

2.2.7 LLSImpute

Kim et al. (2005) developed a local least squares (LLSImpute) method based on

the KNNImpute algorithm and LSImpute method. In this method, K genes most

correlated with a target gene i are selected from the set of candidate genes based

on Euclidean distance or the Pearson correlation coefficient. Once the K genes are

identified, the known expression values ai(−v) of the target gene i is represented as a

linear combination of the K genes

aT_i(−v) = GT_(−v)b, (2.24)

where GT

(−v) is the expression matrix of the K genes excluding the arrays v in which the target gene i has missing values, and b are the coefficients of the linear combination

obtained from least squares. The missing values a(v) can be estimated by

b

aT_(v) = GT_(v)bb. (2.25)

They demonstrated that LLSImpute showed the best performance when strong local

(35)

22

2.3 Identification of Differentially Expressed Genes

One of the main goals of microarray data analysis is to compare the expression levels

of genes in two different samples, one sample is considered the reference or control and

the other is considered the treatment. Obvious examples include comparing healthy

versus diseased tissues or treated versus untreated tissues. With one-color systems,

the comparisons occur between gene transcript levels from two or more sets of arrays.

With two-color cDNA arrays, the ratio of two fluorescence intensities, (Red, Green),

for each spot are used to make comparisons (Speed, 2003).

A number of methods have been proposed for detecting DE genes in microarray

experiments since the microarray technique was developed. A straightforward method

is the traditional F/t-statistics (Dudoit et al., 2002; Draghici, 2003) to identify genes

with larger ratios between group mean differences and variances in transcript levels.

However, in a typical microarray data set, the number of genes is large (> 1000)

whereas the numbers of replicates of arrays are small. The performance of the

F/t-statistics depends on the sample size and whether transcript levels are normally

dis-tributed. As the number of replicates within treatments is often small for microarray

experiments and the data may not be normally distributed, the performance of the

F/t-statistics is usually poor. Several modeling approaches have been suggested to

detect significantly expressed genes based on various distributional assumptions, such

(36)

test approach (Wang and Ethier, 2004), and a mixture model approach (Pan et al.,

2001). In addition, some researchers have proposed nonparametric methods like the

Mann-Whitney U-test and the Wilcoxon rank test, but the powers of these tests is

smaller than the F/t-statistics because of the small sample size in microarray studies

(Wang and Ethier, 2004).

Recently, Yano et al. (2006) proposed a gene discovery method based on

correspon-dence analysis with a new index for expression ratios and three artificial marker genes.

This method used scores (Coordinates) of the first two or three principle components

to identify genes related to phenotypes or treatments of interest in a 2-dimensional or

3-dimensional subspace. The new index for gene expression ratios is the arctangent of

reciprocal intensity ratio [arctan(1/ratio)] and ranges from 0 to 90. When a ratio is

equal to one, the index is 45. As the gene transcript is increased or decreased, this

index increases or decreases from 45, respectively. To identify differentially expressed

gene transcripts, three artificial genes, ExtraGene1, ExtraGene2, and ExtraGene3, are

added to the dataset from a quantitative trait with two phenotypes (A and B). The

expression ratio of ExtraGene1 is zero in the samples of phenotype A, and the

maxi-mum expression ratio, for example 100, is given to B samples. ExtraGene2 has inverse

gene expression pattern as ExtraGene1. ExtraGene3 shows the same ratio (1.0) in

all samples. ExtraGene1 and ExtraGene2 assist in the discovery of genes

differen-tially expressed in the two phenotypes and ExtraGene3 aids in the identification of

(37)

24

In a CA biplot obtained using the new index and the three ExtraGenes, up- and

down-regulated genes have negative and positive scores in Factor1, respectively, and

lie on a line segment between ExtraGene1 and ExtraGene2. This line segment is called

UDL (up/down line). All housekeeping genes and ExtraGene3 lie in the center of

the UDL, the origin of the subspace. A distance from an ExtraGene is estimated and

used to identify up-/down-regulated and housekeeping genes, which is called significant

distance and calculated by the following formula:

d = v u u tχ2 ν,α/ n X i=1 fi, (2.26)

where ν is the degrees of freedom or the subspace dimensions, α is a predefined

sig-nificance level, n is the number of samples, and fi is the new index [arctan(1/ratio)]

for the ExtraGene in the ith sample. For example, Yano et al. applied the confidence

areas to a space of seven dimensions for the breast cancer data which had 62 samples

in each of phenotypes. So the significant distance at the 5% significance level from

ExtraGene3 becomes

d =p_{14.0671/(45 ∗ 124) = 0.0502.}

The genes located outside the confidence area are identified as differentially expressed

(38)

2.4 Application of Correspondence Analysis to

Mi-croarray Data

Correspondence analysis is an algebraic technique, which allows for a simultaneous

rep-resentation of two data sets that comprise the rows and columns of a data matrix with

non-negative entries (Greenacre, 1984). It treats rows and columns of the data matrix

in a symmetrical fashion and creates a low-dimensional projection from an originally

high-dimensional data set, revealing interdependencies between two sets of variables.

The biplot is commonly used to display the relationships both between row variables

and between column variables by plotting their projections on two independent

dimen-sions.

CA has been applied over the years in psychometric and economic studies. However

application of CA to microarrays is relatively new. Kishino and Waddell (2000) applied

CA to data consisting of 6500 genes expressed in 40 tumor and 22 normal colon tissue

samples by Alon et al. (1999) to visualize the relationship between genes and tissues.

Correspondence analysis gives most weight to over-expressed genes, and these are best

seen associating with the tissues that over-express them. The genes that are expressed

at similar levels in most tissues are located around the center of the CA biplots, whereas

genes that have high expression level in one tissue (e.g. tissue A) and low average

expression level in the other tissues are drawn to that tissue (e.g. tissue A). Kishino

(39)

26

located close together, whereas some tissues from the same tissue types are located far

from each other. Exploring the association among tissues based on gene expression

with correspondence analysis, they indicated that the distances between two tissues

in a CA biplot are determined by the correlation between the two tissues rather than

their types.

Fellenberg et al. (2001) demonstrated the applicability of correspondence analysis

to both time series and non-time series data sets. The time series data was the well

known Saccharomyces cerevisiae cell-cycle synchronization data by Spellman et al.

(1998) which was composed of 800 genes and 73 hybridizations, four methods at various

time points. The data set was translated to the logarithmic ratios of intensities of

two-channel fluorescence signals. To implement CA, Fellenberg first found the minimum

logarithmic ratio from the data, then shifted all data set to positive range by adding

abs(minimum)+1. The missing values were treated as unchanged, i.e. ratio = 1 . The

non-time-series data originated from the authors’ lab and it investigated the transition

of CDC14 from mitosis to G1, for two of four cell cycle stages. The data consisted

of 1400 genes and two-paired experimental conditions. Fellenberg et al. showed that

CA biplots “provide an informative and concise means of visualizing [microarray data

and] are able to discover relationships both among either genes or hybridizations and

between genes and hybridizations” (Fellenberg et al., 2001). For example, in the biplot

obtained by CA applied to the cell-cycle data, the hybridizations were clearly separated

(40)

in which they had strong expression, and the genes with similar expression patterns

were clustered together. In addition, CA can be generally applied to microarray data,

either as intensities or log ratios.

Tan et al. (2004) extended and applied CA to high dimensional microarray

time-course data in case-control studies. The features of CA were used to explore the various

time-course gene expression profiles existing in the data, e.g. to identify the differential

time-course expression profiles conditional on the disease status or/and to find shared

time-course gene expression patterns in both the patients and the normal controls. By

coordinating and examining the biplot produced by CA to diabetes data consisting

of two patients and two normal controls, they identified differential and similar

time-dependent gene expression profiles in response to insulin stimulation in a given disease

status and found that the second axis characterized the time-course patterns regardless

of the disease status. With the sample replicates, the bootstrap technique was applied

to CA for assessing the significance of the contribution by each gene and by each

time-course experiment. The contribution is the proportion of the inertia of the lth leading

axis explained by the ith gene, defined as

cil = ri.gil2/λ2l, (2.27)

where ri. is the sum of row i of the matrix A divided by the total of A, gil is the

coordinate of gene i on the lth axis, see the formula for G in Chapter 3, and λ2 l is the eigenvalue in the lth dimension. The expected contribution for each gene is 1/n for

(41)

28

all the dimensions, where n is the number of genes. They generated 30, 000 bootstrap

samples and evaluated the significance of the contribution on the leading axes, ci. = P

lcil, for each gene using the bootstrap p−value obtained by pi ≡

P30,000

j I(ci.j ≤

ce)/30, 000, where I(.) is the indicator function, ci.j is the contribution estimated for

gene i in bootstrap sample j, and ce = _n1 is the expected contribution. A bootstrap

sample matrix was obtained by taking the average gene expression from two bootstrap

matrices which were each formed by randomly picking with replacement rows from two

control replicates and two diabetic replicates. Genes were ranked by their significance

levels in their contributions on each of the first three axes and top genes were selected

(42)

Methods

3.1 Correspondence Analysis

3.1.1 Introduction

Correspondence analysis (CA) is a statistical visualization method for analyzing

sim-ple two-way and multi-way tables containing some measure of correspondence between

the rows and columns. It originated in France and the name was translated from

the French Analyses des Correspondances, where the term correspondence denotes a

“system of associations” between the elements of two sets (Greenacre, 1984). As an

exploratory and descriptive data analytic technique, CA remarkably simplifies complex

data, identifies systematic relations between variables, and provides a detailed

descrip-tion of informadescrip-tion in the data. It was developed for contingency tables early, and then

(43)

30

applied to general data matrices. In this chapter, we summarize the main features of

correspondence analysis. Further details can be found in the references (Greenacre,

1984 and 1993).

In a contingency table, interest often centers on the association between rows and

columns. The significance of the association is tested using the Chi-square test, but

this test provides no information as to which are the significant individual associations

between row-column pairs of the data matrix. Correspondence analysis shows how

the variables are related, not just that a relationship exists, by jointly displaying the

rows and columns of a contingency table as points in a constructed orthogonal system

of principal factors. The positions of the row and column points are consistent with

their associations in the table and the distances among the row points and among the

column points are weighted Chi-squared distances. The principal factors or axes are

constructed according to the information they represent and therefore are presented in

a decreasing order of variability. The goal is to have a global view of the data that is

useful for interpretation. The dimension of a contingency table is m-1, where m is the

minimum of the number of rows and columns. The information included in a subspace

of dimension p (p < m − 1) equals the sum of the information included in the p factors.

The average percentage of the total information represented by one factor is

100/(m-1), which serves as a guide in determining the relative importance of a given factor.

In this system, proximity between observations or between variables is interpreted as

(44)

as strong association. For microarray data, genes as the variables are in the rows

and treatments or replicates as the observations are in the columns. Correspondence

analysis displays simultaneously observations and variables on the same space making

it easy to discover the salient information included in a given data table.

3.1.2 Basic Concepts and Definitions

The commonly used concepts in correspondence analysis are described below.

Primitive matrix_{: the original data matrix, N(I ×J), or contingency table, is called} the primitive matrix. The elements of this matrix are nij and nij ≥ 0.

Correspondence matrix: the correspondence matrix P is defined as the primitive matrix N divided by the grand total n =PI_i=1PJ_j=1nij,

P= N/n, (3.1)

which shows how one unit of mass is distributed across the cells.

Masses: let r denote row mass, a vector of length I which contains the row sums of

the correspondence matrix. Let c denote column mass, a vector of length J, containing

the sums of columns.

r = P1 and c= PT1, (3.2)

where 1 is a vector of 1’s.

(45)

32

Let C and R denote the matrices of row and column profiles respectively, then

R _{≡ D}−1_r P_≡         ˜rT 1 ... ˜rT I         , C_{≡ D}−1_c PT _≡         ˜cT 1 ... ˜cT J         , (3.3)

where Dr = diag(r) is a diagonal matrix of row masses, Dc = diag(c) is a diagonal

matrix of column masses, ˜ri is the ith row of P divided by ri, and ˜cj is the jth column

of P divided by ci.

Clouds of points N(I) and N(J): the row cloud of points N(I) are the I row profiles

˜r1. . . ˜rI in J−dimensional space, whose masses are the elements of r. Similarly, the

column cloud of points N(J) are the J column profiles ˜c1. . . ˜cJ in I−dimensional space,

whose masses are the elements of c.

Weighted distances and Chi-square distances: in CA, a variant of Euclidean distance, called the weighted Euclidean distance, is used to measure the distances

between profiles points. The squared distance between two profiles x and y is given by

d2_(x,y) _{= (x − y)}T Dq(x − y) , (3.4)

where Dqis the matrix of dimension weights. Dq = D−1c in J-dimensional space defined

by columns, and Dq = D−r1 in I-dimensional space defined by rows. For example, the

squared distance between two points of row i and row i′

in J-dimensional space and

the squared distance between two points of column j and column j′ _{in I-dimensional}

(46)

the Chi-square distance. d2_(˜_r_i_,˜_r i′) = (˜ri− ˜ri′) T D−1_c (˜ri− ˜ri′) , (3.5) d2₍_˜_c j,˜cj′) = (˜cj − ˜cj ′)T D −1 r (˜cj − ˜cj′) . (3.6)

As is well-known, the Chi-square statistic, see Equation (3.7), tests whether the

devia-tions of observed frequencies are significant departures from their expected frequencies

assuming independence. This can be rewritten in the form of Chi-square distances in

terms of row and column profiles, as given in Equations (3.8) and (3.9).

χ2 = X (i,j)∈I×J (oij − eij)2/eij = I X i J X j (nij − nricj)2 nricj (3.7) = I X i J X j nri(˜rij − cj)2 cj = n I X i ri(˜ri− c)TD−c1(˜ri− c) (3.8) = I X i J X j ncj(˜cij − ri)2 ri = n J X j=1 cj(˜cj − r)TD−r1(˜cj− r), (3.9)

where oij = nij and eij = nricj, the observed and expected frequencies in the (i, j) cell

of the matrix, respectively.

Total inertia: the quantity χ2_{/n is defined as the total inertia of the data matrix,} which is a measure of total variation of the matrix. From Equations (3.8) and (3.9),

total inertia is given as

Total inertia = I X i ri(˜ri− c)TD−c1(˜ri− c) = in(I) (3.10) = J X j cj(˜cj− r)TD−r1(˜cj− r) = in(J), (3.11)

(47)

34

where in(I) and in(J) are total inertia of clouds of points N (I) and N (J), respectively.

The total inertia can be interpreted as the weighted average of squared χ2 _distances

between the profiles (rows or columns) and their centers.

3.1.3 Algebra of Correspondence Analysis

Correspondence analysis is a technique to find a low-dimensional approximation to

the original data matrix that graphically represents the weighted χ2 _{distances between}

row (or column) profiles and their average profiles. This low-dimensional subspace, S′

is determined by minimizing the sum of projected distances between each of profiles

to the subspace S′

. Singular value decomposition (SVD) gives the solution for the

optimal subspace with any specified dimensionality. The SVD decomposes any real

I × J matrix S of rank K as the product of three matrices:

S = UDαVT, (3.12) i.e. S = K X k αkukvTk, (3.13) where UT_U_{= V}T_V_{= I and α} 1 ≥ α2 ≥ . . . ≥ αK > 0. The elements α1, α2, . . . , αK of the diagonal matrix Dαare called singular values of S. The columns of U (u1, u2, . . . , uK)

are called left singular vectors and the columns of V (v1, v2, . . . , vK) are called right

(48)

the respective basis vectors in V and U are Equations (3.14) and (3.15), respectively.

F = UDα, (3.14)

G = VDα. (3.15)

The optimal K∗

-dimensional projection S′

is defined by the first K∗

terms of S, i.e. S′ = K∗ X k αkukvTk, (3.16) then S′ minimizes: kS − Xk2 ≡X i X j (sij − xij)2, (3.17)

amongst all I × J matrices X of rank at most K∗

(Greenacre, 1984).

In correspondence analysis, we project the row and/or column profiles of a

ma-trix onto an optimal K∗_{-dimensional subspace using χ}2 _{distances. The matrix Q =}

D−r1/2 P− rcT

D−c1/2 with elements qij = (pij − ricj)/√ricj, called the matrix of standardized residuals, is decomposed. To perform this analysis, the matrix (P − rcT_),

P centered by the row and column masses, is decomposed as

P_{− rc}T = ADαBT, (3.18)

using generalized SVD, where AT_D−1

r A= BTD−1c B= I and the columns of A and B are called the principal axes of the column and row profiles, respectively. The difference

between SVD and generalized SVD is that SVD does not take account of masses and

di-mension weights while generalized SVD does. Then the matrix D−1/2

r P− rcT

D−c1/2

(49)

36

can be represented in the Equation (3.19). The principal coordinates of the row profiles

centered by the column masses R − 1cT_{and the column profiles centered by the row}

masses C − 1rT in terms of their own principal axes and to their singular vectors

are given by the Equations (3.20) and (3.21), respectively.

D−1/2_r P_{− rc}TD−1/2_c = UDαVT, (3.19)

F= D−1

r ADα = D−r1/2UDα, (3.20) G= D−1

c BDα = D−c1/2VDα, (3.21)

where U = D−1/2r A and V = D−1/2c B. The first K∗ < K columns of F and G

provide the optimal solution in K∗

dimensions and the variation explained by the

K∗

-dimensional subspace is the sum of the first K∗

squared singular values, called

eigenvalues. Table 3.1 summarizes the algebra of correspondence analysis (Greenacre,

1984). The following are the proofs of Equations (3.20) and (3.21).

proof D−1/2 r P− rcT D−1/2 c = D −1/2 r DrR− rcT D−1/2 c = D1/2_r R_{− 1c}TD−1/2 c = UDαVT R_{− 1c}T = D−1/2 r UDαVTD1/2c = D−1/2 r UDαBT (V = D−c1/2B) F= D−1/2_r UD_α = D−1_r AD_α (U = D−1/2_r A)

(50)

Similarly, D−1/2_r P_{− rc}TD_c−1/2 = D−1/2_r CTDc− rcT D−1/2_c = D−1/2 r CT − r1T D1/2_c = UDαVT C_{− 1r}TT = D1/2_r UDαVTD−c1/2 C_{− 1r}T = D−1/2 c VDαAT G= D−1/2 c VDα = D−c1BDα

(51)

38

Table 3.1: Algebra of correspondence analysis.

Row Column Masses: r= P1 c= PT₁ Profiles: R_{≡ D}−1_r P_≡    ˜rT 1 ... ˜rT I    C_{≡ D}−1_c PT _≡    ˜cT 1 ... ˜cT J   

Clouds of The I row profiles ˜r1. . . ˜rI The J column profiles ˜c1. . . ˜cJ

points: _{in J−dimensional space} _{in I−dimensional space}

Clouds of masses: The I elements of r The J elements of c

Distances: (xr− yr)T D−c1(xr− yr) (xc− yc)TD−r1(xc − yc)

Centroids of cloud : c = RTr r= CTc

Total inertia: in(I) =P_iri(˜ri− c)T D−c1(˜ri− c) in(J) =P_jcj(˜cj− r)T D−r1(˜cj − r)

= tracehD_r R_{− 1c}TD−1_c R_{− 1c}TTi = tracehD_c C_{− 1r}TD−1_r C_{− 1r}TTi

and in(I) = in(J) = tracehD−1

r P− rcT D−1 c P− rcT Ti Coordinates Principal : F= D−1 r ADα G= D−c1BDα Basis: F= D−r1/2UDα G= D −1/2 c VDα

(52)

3.1.4 Projection and Two-Dimensional Maps

Correspondence analysis is a dimension reduction method which identifies an optimal

subspace where the deviations from the expected values can be represented. It

simul-taneously analyzes the row profiles (or row analysis) and column profiles (or column

analysis) of the data matrix. The row or column analysis of a matrix is the projection of

row or column profiles from a multidimensional space onto an optimal low-dimensional

subspace for interpretation of the inter-profile positions. Equations (3.10) and (3.11)

show that the row analysis and column analysis have the same total inertia so that

the both analyses have the same decomposition of inertia into principal inertia along

principal axes.

With the matrices, U, V, and Dα, we can project the row profiles into the new

space defined by V and the column profiles into the new space defined by U. In

general, the projections may take place onto any low-dimensional subspace, but most

of CA displays are two-dimensional with (α2

1 + α22)/

PK

k=1α2k× 100% of total inertia accounted for by the plane of the two axes.

There are two types of two-dimensional displays in correspondence analysis:

sym-metric maps and asymsym-metric maps. In an asymsym-metric map, also called a biplot, one of

the sets of variables (e.g. column variables) are displayed as the extreme vertices

rep-resenting the principal axis directions of the corresponding variables, while the other

(e.g. row variables) is displayed as the profile points. Each vertex can be viewed as a

(53)

40

between the two sets of variables as well as the variation amongst the variables of

profile points are apparent in an asymmetric map. The coordinates of the row vertices

and column vertices are given by the Equations (3.22) and (3.23).

Φ = D−1/2

r U, (3.22)

Γ = D−1/2

c V. (3.23)

If the display is of good quality, say more than 85% of the total inertia captured by

the display, the closer a profile is to that vertex, the higher its profile is with respect

to that variable.

In a symmetric map, the clouds of points N (I) and N (J) are overlaid in a joint

dis-play in principal coordinates. The two clouds of points should be interpreted separately

and the association between the rows and the columns cannot be judged graphically

in a symmetric map. The distance between any two row (or column) profiles is the

Chi-square distance, as mentioned previously. The magnitude of this distance between

two points depends on how similar the two profiles are in their shapes rather than their

absolute values. The center of the map is the average profile so the further a point

is from the center the bigger the difference between their profiles. By simultaneous

inspection of a symmetric map and an asymmetric map, we can easily identify genes

that are differentially expressed under a certain treatment.

To illustrate correspondence analysis, let us consider a simple example. The data

(54)

of different employees in a company. This is a 5 × 4 table, thus the dimension of the

Table 3.2: Matrix of artificial data.

Smoking category Row

Staff group None Light Medium Heavy totals

(SM)Senior managers 4 2 3 2 11 (JM)Junior managers 4 3 7 4 18 (SE)Senior employees 25 10 12 4 51 (JE)Junior employees 18 24 33 13 88 (SC)Secretaries 10 6 7 2 25 Column totals 61 45 62 25 193

table is 3 and its row profiles and column profiles lie in three-dimensional spaces.

Assume that we are interested in the number of the staff in each group or in each

smoking category out of the sample of 193; for example, the group of JE has the

largest sample size and has the most heavy smokers. However, these marginal views

do not describe the patterns of the smoking habits. Table 3.3 lists the relative row

Table 3.3: Relative frequencies of smoking categories within staff-groups.

Smoking category Row

Staff group None Light Medium Heavy totals

(SM)Senior managers 0.3636 0.1818 0.2727 0.1818 1 (JM)Junior managers 0.2222 0.1667 0.3889 0.2222 1 (SE)Senior employees 0.4902 0.1961 0.2353 0.0784 1 (JE)Junior employees 0.2045 0.2727 0.3750 0.1477 1 (SC)Secretaries 0.4000 0.2400 0.2800 0.0800 1 Average 0.3161 0.2332 0.3212 0.1295

frequencies, which are called row profiles. From this table it is easy to compare the

distribution of the staff of each group within each smoking category; for example,

(55)

42

this group and over all five groups, and senior managers and junior managers have

relatively higher proportions of heavy smokers over all groups. If we visualize this data

A Axis_1 0.0748 ( 87.76 % ) Axis_2 0.01 ( 11.76 % ) -2 -1 0 1 -2 -1 0 1 SM JM SESC JE None Light Medium Heavy Staff groups Smoking habits B Axis_1 0.0748 ( 87.76 % ) Axis_2 0.01 ( 11.76 % ) -0.4 -0.2 0.0 0.2 -0.4 -0.2 0.0 0.2 SM JM SE JE SC None Light Medium Heavy

Figure 3.1: Two-dimensional displays obtained by correspondence analysis of the smok-ing habit data. A: the asymmetric map; B: the symmetric map.

graphically, the row profiles can be displayed in a four-pointed tetrahedron, which

can be observed in a three-dimensional space of the points. Figure 3.1 is the

(56)

where (A) is the asymmetric map showing the projections of row profiles (in principal

coordinates) and the four column vertex points (in standard coordinates) and (B) is

the symmetric map showing the projections both by the row profiles and by the column

profiles. The triangles denote the smoking category and the dots correspond to staff

groups. The inertias explained by the first two principal axes are 0.0748(87.76%) and

0.0100(11.76%), respectively. The cumulative inertia in the plane is 0.0848 that is

99.5% of the total inertia.

Looking at the positions of the profiles (dots) relative to the vertices (triangles) in

Figure 3.1 (A), we see that three smoking categories are on the right and the

nonsmok-ing category is on the left. This means that the right-to-left distinction is tantamount

to smokers versus nonsmokers. The groups JE and JM have relatively more medium

smokers than SE does. The actual data are that 14.77% of JEs and 22.22% of JMs are

medium smokers whereas 7.84% of SEs are medium smokers, which agrees with the

interpretation of Figure 3.1 (A).

In Figure 3.1 (B), we notice that the coordinates of points of column profiles are

scaled differently from that in (A), i.e. gk = αkγk, where gk is the kth column of G

and γk is the kth column of Γ. Figure 3.1 (B) is a joint display in which the separate

configurations of row profiles and column profiles are overlaid. The distance between

the row points (or the column points) approximates the inter-row χ2 _{distance, which}

is a measure of similarity between the row (or the column) profiles. The groups JM

(57)

44

JM is near the group JE since they are relatively similar to each other in terms of

their smoking habits. Distances between the points representing smoking categories

are interpreted in the same way.

The center of Figure (A) or (B) is the average profile. The deviations of the staff

groups can be considered as outwards from the average profile in different directions.

3.2 Algorithm for Identifying DE Genes and

Out-liers

As mentioned in the previous section, the center of a biplot is always the average

profile and the distances between row (column) profile points and the center of a

biplot are the measure of deviations of rows (columns) from their average profile. One

of our goals is to estimate a region in the biplot surrounding the center such that

one can consider with some level of probability that the points which lie outside the

region are differentially expressed genes or outliers. Figure 3.2 indicates graphically

what is desired, which is the biplot for simulated microarray data consisting of three

arrays (three treatments or three replicates of a condition). The ellipse describes a

confidence region with content 0.95. When comparing treatments, the genes simulated

with significantly higher/lower intensities in one array than the others lie on or outside

the ellipse; the higher/lower the intensity, the further from the center in the direction or