• No results found

7EXPRESSION PROFILING AND FUNCTIONALGENOMICS: DIFFERENTIAL EXPRESSION

N/A
N/A
Protected

Academic year: 2021

Share "7EXPRESSION PROFILING AND FUNCTIONALGENOMICS: DIFFERENTIAL EXPRESSION"

Copied!
1
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

7 EXPRESSION PROFILING AND FUNCTIONAL GENOMICS: DIFFERENTIAL EXPRESSION

7 Expression profiling and functional genomics: Differential expression...1

7.1. Introduction...2

7.2. Identification of differentially expressed genes in a two sample design...2

7.2.1 Fold Change:...2

7.2.2 T -test...3

7.2.3 SAM (Significance analysis of microarrays)...5

Revised 30/11/2006

(2)

7.1. Introduction

A microarray experiment measures the expression levels from thousands of genes in parallel. Genes that show little or no change in expression levels are typically of no biological relevance. As such, a selection of the genes show a variable expression across the condition tested is often a crucial step in the analysis of any microarray experiment. Over the years many methods have been proposed for the identification of significantly differential genes, some of which are discussed below.

When consistent sources of variation have been removed by normalization, the replicate log-ratio measurements for a particular gene can be combined to find out whether a gene is differentially expressed.

A plethora of methods is available to identify differentially expressed genes in a statistically more founded way than the simple heuristic of a fold test. Distinct classes of models can be discerned, differing from each other in the test statistic used, in the way the null hypothesis is modeled, and in their underlying assumptions.

Usually in a microarray design the number of replicate experiments is low, because microarray analysis is costly and time-consuming. This low rate of replication complicates statistical inference: because of the low power of the tests the detection of genes that show small but significant changes in expression is nearly impossible. This problem is even exacerbated by the fact that usually many genes are tested simultaneously implying a serious problem of multiple testing. Compensating for this in the classical way (e.g. Bonferronni correction) decreases the power of most statistical tests unrealistically. In this chapter, we will first describe different methods to identify differentially expressed genes. In a second part we will tackle the issue of multiple testing.

7.2. Identification of differentially expressed genes in a two sample design

The issue in a two-sample design is to establish whether each gene behaves differently in a control versus a treatment situation.

Different strategies have been described

 Non statistical tests: fold change, used mainly when no replicates are available (e.g. fold test)

 Statistical tests: A plethora of novel methods to calculate a test statistic and the corresponding significance level have recently been proposed, provided replicates are available. Each of these methods first calculates a test statistic and subsequently determines the significance of the observed test statistic. Distinct t-test like methods are available that differ from each other in the formula that describes the test statistic and in the assumptions regarding the distribution of the null hypothesis. t-test methods are used for detecting significant changes between repeated measurements of a variable in two groups. In the standard t-test, it is assumed that data are sampled from a normal distribution with equal variances (zero hypothesis). For microarray data the number of repeats is too low to assess the validity of this assumption of normality. To overcome this problem, methods have been developed that estimate the distribution of the zero hypothesis from the data itself by permutation or bootstrap analysis (SAM).

7.2.1 Fold Change:

The fold test is a non-statistical selection procedure selection procedure that makes use of an arbitrary chosen threshold. For each gene i an average log-ratio M

i

(arithmetic mean of the replicate log-ratios) can be calculated as:

ref i test i

i

I I

M

,

,

with I

i,test

being the average of the logarithm transformed intensities for gene i in the testcondition, and

ref

I

i,

being the average of the logarithm transformed intensities for gene i in the reference condition.

Average log-ratios that exceed a certain threshold (usually chosen to correspond to a twofold expression

ratio) are retained. The fold test is based on the intuition that a larger observed fold change can be more

confidently interpreted as a stronger response to the environmental signal than smaller observed changes.

(3)

Ad hoc: The reasoning behind the fold test implicitly assumes that the variance among replicates within treatments is the same for every gene.

Drawback: threshold is chosen arbitrarily and should depend on the absolute value of the expression values.

Confidence in ratio should depend on the absolute values of the expression measurements

 E.g. in log scale a two fold change will be too conservative for high expression levels and liberal for genes expressed at low level.

 E.g. genes for which expression is off in one channel have elevated expression ratio When no replicates are available, a fold test is still sometimes used. However better solutions are:

When replicates are available a fold test is less appropriate. Indeed, a fold test, however, discards all information obtained from replicates. For instance, when either one of the measured channels obtains a value close to zero, the log ratio estimate usually obtains a high but inconsistent value (large variance on the variables). If replicates are available use a statistical test (see below, t test or SAM).

7.2.2 T -test

Possible when the samples are replicated.

A t-test is a hypothesis test that assumes that the observations are drawn at random from a normal population and that employs a Student t-distributed test statistic for confidence interval estimation. The t-distribution describes the distribution of a normal variable, standardized with the sample variance s

2

as opposed to the population variance 

2

. It is used for hypothesis testing of normally distributed variables when the population variance 

2

is unknown, in which case the sample variance s

2

is used as an estimator of 

2

. The t-test is more appropriate to make statistical inference about the differential expression of a gene than a simple fold test, since it does not only take into account how much a gene is differentially expressed, but also the consistency of the individual measurements used to assess the average differential expression level.

The non-paired t-test evaluates if the average expression level of a gene in the test condition is significantly different from its average expression level in the reference condition. The H

0

hypothesis states that the expression level of the test and reference are equal. The formula to compute the test statistic is depicted below. To calculate the within sample variance of a regular non-paired t-test, the four observations of the test are used to estimate the mean expression level of the gene in the test condition. In the same way the four measurements of the reference are considered as a single group. The standard deviations ( s

i,KO

, s

i,WT

) are computed based on the deviation of the different measurements of a group from their respective group means ( I

i,test

, I

i,ref

). Of course when the within variance is calculated in such a way, it intrinsically contains the consistent variations due to array and spot effects (the absolute expression values instead of the ratios are used to calculate an estimate of the average differential expression level). This problem can be overcome by using a paired t-test.

The paired t-test is a special case of the two-sample t-tests of hypotheses that occurs when the observations on the two populations of interests are collected in pairs (in a cDNA microarray experiment, measurements of the Cy5 and Cy3 channel for a particular gene, assessed on the same array and the same spot, are paired).

The difference with an unpaired two-sample t-test is that both variables are presumed to be dependent. This

translates into the incorporation of the covariance between both variables in the test statistic. As a result, a

positive correlation within the pairs can cause the unpaired two-sample t-test to considerably understate the

significance of the data if it is incorrectly applied to paired samples. Below is outlined how a paired t-test is

calculated for spotted microarray data. For computation of the variance, a pair of observations can

considered as a new variable. The within group variation, as calculated by a paired t-test evaluates the

deviation of this new variable from the mean of that variable, taking into account the covariance between

log-intensities obtained from the same spot. As such a paired t-test, in contrast to a regular non-paired t-test

intrinsically compensates for the variation over spots and arrays. The lower within group variation increases

the power of a paired t-test as compared to a regular t-test.

(4)

In practice, the advantage of a (paired) t-test is that smaller fold changes are considered significant for genes whose expression levels are measured with great accuracy (high consistency), and large fold changes are considered non-significant if expression levels were not measured accurately (low consistency).

Summarizing:

Model assumptions

 Normal distribution of the variables (i.e. the different replica’s should be normally distributed, to obtain this usually a log transformation on the raw data is required).

 Under null hypothesis 

test

and 

ref

are similar (data are normalized)

 For a cDNA array usually a paired T-test is performed (indeed red and green value for the same spot are paired). Test is performed on the difference of the log transformed raw data of the paired samples ((log(test)-log(ref) = log(test/ref)). This is the counterpart of the fold test.

Under the null hypothesis, m

ref

and m

test

are assumed to be equal.

Non paired t test, equal group variances ) (

/ ) (

2 2

test test ref ref test

ref unpaired

n s n m s

m

t   

Paired t test

n S T D

D

/

0

Empirical means and variances as estimates for mean and standard deviation. t follows Student distribution.

When t exceeds a certain threshold depending on the calculated degrees of freedom and the confidence interval selected, the two genes are considered to be different. p-value derived from Student distribution describes the probability that the t-test statistic will take on a value that is at least as extreme as the observed value of the statistic when the H0 is true. If this p-value is sufficiently low, the null hypothesis is rejected and both values of m are considered significantly different.

Advantage This results in smaller fold changes being significant for genes whose expression levels are measured with great accuracy and large fold changes being non-significant for genes whose expression level can not be measured very accurately.

Drawback:

 number of replicates too small for a t-test too be powerful

 t-test on ratio looses information: the variability is dependent on the intensities of both channels.

When using the ratio as estimator this information is lost.

 model assumptions might not be met (deviations from normal distribution) Availability:

Implemented in Cyber-T software http://genomics.biochem.uci.edu/genex/cybert/

o Adapted t-test for zero values in either control or treated sample. Zero values are replaced by

the lowest reliable detected value. (average of the two 0.25 % quantiles associated with the

(5)

o t-test implementation for paired data i.e. red and green from same array because variance is lower

7.2.3 SAM (Significance analysis of microarrays)

SAM (Significance Analysis of Microarrays) is another method for the analysis of paired or unpaired black and white experiments. SAM calculates for each gene a modified t(i) statistic, called relative difference and referred to as d(i) in the original article. The difference between a t-test statistic t(i) and the d(i) values calculated by SAM, is the constant term s

0

, used to compensate for the dependency of the distribution of d(i) on the measured expression level.

Coefficient of variation d(i) = relative difference in gene expression level where xI(i) and xU(i) are the average levels of gene i in states I and U. s(i) is the genes specific scatter i.e. the standard deviation of repeated expression measurements where m and n are the summations of the expression measurements in states I and U respectively. At low expression levels variances in d(i) can be high because of small values in s(i). S0: small correction coefficient to ensure that the variance of d(i) is independent of gene expression level. To make d(i) independent of the expression levels a small value s0 is added (pseudocount). As such the distribution of d(i) becomes independent of the gene expression levels and this allows comparison of d(i) across all genes. (the coefficient of variation of d(i) was computed as a function of s(i) in moving windows across the data, the value of s0 was chosen to minimize the coefficient of variation as function of the expression levels).

Genes are ranked according to their d(i) value and the higher the absolute d(i) value, the more likely that the gene will be differentially expressed. Instead of calculating a p-value using a Student t-distribution, genes called differentially expressed are identified by performing a permutation analysis. New random datasets are generated by permuting the original data. In these permutated datasets, none of the genes is differentially expressed. The d(i) values in these randomized datasets are calculated, ranked, and subsequently used to infer the expected differences, i.e. the d(i) value that can be expected if a gene is not differentially expressed.

By using a scatter plot (see below), ranked d(i) values of the experimental dataset are compared to ranked expected d(i) values.

The delta value  , a user-specified parameter determines the number of significant, differentially expressed genes; it expresses how much the measured d(i) value should exceed the expected one in order to consider a gene differentially expressed (  is measured as a displacement of the d(i) value from the d(i)=d

expected

(i) line).

The number of false positives can be estimated as the number of genes present in the permuted dataset, for which the d(i) value exceeds the lowest d(i) value that was considered significant based on a given setting of the delta slider. Permutation analysis overcomes the need of a high number of replicates and is used as an alternative to correction for multiple testing. The setting of the delta slider allows choosing a trade-off between the number of false positives (type I error) and the number of false negatives (type II error). The lower the number of false positives, the more stringent the test and the fewer genes will be withheld as significant.

Scatterplots are generated of the observed relative differences versus the expected relative differences.

Compare observed relative difference d(i)) versus expected relative difference d

E

(i) . For most of the genes

d

E

(i) = d(i) but some genes are removed from the d

E

(i) = d(i) line by more than a threshold distancetuning

parameter. These genes are considered significant.

(6)

For each value of the tuning parameteran estimate of the FDR is calculated (FDR = The percentage of genes with a d(i) score higher than a threshold identified by chance is defined as the false discovery rate).

To estimate the FDR nonsense genes are identified by analyzing permutations of the measurements (bootstrapping).

Define horizontal cut off: smallest d(i) among the genes that is significantly induced or least negative d(i) that is considered as significantly repressed. Count the number of genes in each permutation that exceeds this horizontal cut off for repression or induction. The average FDR is the mean number of genes over the different permutation tests.

Tusher VG, Tibshirani R, Chu G. Significance analysis of microarrays applied to the ionizing

radiation response.

Proc Natl Acad Sci U S A. 2001 Apr 24;98(9):5116-21.

7.2.4 Comparison t test and SAM

2 2 2 1 2 1

2 1

n i s n i s

y i y i iz

 

n y i y i

s i s i

y i y i iz





 

, 2 cov 1 2 2 2 2

1

2 1

2 1 1 2 1 0

2 1

n n s ip s

y i y i

iz  

  T -te s t

P a ire d t-te s t

S A M

P a ra m e triz e d : S tu d e n t t- d is trib u tio n E rro rs

n o rm a lly d is trib u te d

R e s tric te d n u m b e r o f re p e a t m e a s u re m e n ts

Im p o s s ib le to e v a lu a te a s s u m p tio n

N o e x p lic it a s s u m p tio n O rd e r s ta tis tic s

T e s t s ta tis tic A s s u m p tio n s D is trib u tio n H 0

E rro rs e q u a l v a ria n c e (iid )

L e s s s trin g e n t a s s u m p tio n

(7)

In the picture above are represented genes, selected to be differentially expressed by 1) all tests, 2) by SAM,

3) by t test, 4) by ANOVA. Blue represents points selected on the first array, red on the second array. Each

gene is represented by 4 points. Note that genes selected by the t test are consistently measured but almost

not differentially expressed.

(8)

Referenties

GERELATEERDE DOCUMENTEN

vaak wordt de vrees geuit dat bij aanwezigheid van een scherm het gebruik van hoofdlicht zal toenemen, waarbij de meeliggers via hun spiegel, en bij schermen

This is a test of the numberedblock style packcage, which is specially de- signed to produce sequentially numbered BLOCKS of code (note the individual code lines are not numbered,

Even if the lexicographer agrees with the decisions of a prescriptive body, the lexicographic presentation should make allowance for different points of departure and different

To find disease specific gene expression profiles, it is also neces- sary to study a large number of patients with different muscular dystrophy types.. 3.2 And

Het beperkte regeneratievermogen van de humane dystrofische spier ten opzichte van de meer efficiënte regeneratie van dystrofisch spierweefsel in de muis kan onder andere

License: Licence agreement concerning inclusion of doctoral thesis in the Institutional Repository of the University of Leiden. Downloaded

Below in this paper, we focus on the rank changes between rankings by W1 and EssWn to see what a difference EssWn makes in ranking authors compared

The firemen also make use of information sharing in order to orchestrate their own activities with the activities of their team members.. The sharing of information occurred