Yves Moreau and Janick Mathys Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

(1)

Yves Moreau and Janick Mathys

Master of Bioinformatics

Katholieke Universiteit Leuven

2001-2002

(2)

n

Genome-wide monitoring of gene activities by

measurement of the levels of RNA transcripts

n

Massively parallel

n

Fully automated

n

Standardizable

(3)

n Identification of regulatory mechanisms

n _{Modeling of genetic networks for function identification and}

transgenesis

n _{Drug development: study gene expression in}

n Disease

n Model systems n Pathogens

n Response to drug treatment

n Diagnosis

n Survey of Single Nucleotide Polymorphisms

(4)

Prediction Preventative_Medicine Follow-up Treatment Diagnostics SNPs Functional Genomics Functional Genomics Expression Monitoring Expression Monitoring

NOW

_FUTURE

(5)

(6)

n

Nylon membrane

n

_{Spotted cDNA}

(7)

n

5000 cDNAs (genes, ESTs) spotted on array

n Glass slide

n Cy3, Cy5 labeling of samples

(8)

(9)

n Silicon substrate

n _{In situ synthesis of oligonucleotides (25 bases) by photolithography}

n _{Multiple probes per gene (match + mismatch)}

(10)

n

Glass slide

n

_{Inkjet spotting}

n cDNA clones (no contact = better liquid handling)

n OR in situ synthesis of long oligonucleotides (60 bases)

n

_{Long oligos more specific than short oligos}

(11)

n

Filters and oligonucleotide arrays

n Single measurement per feature

n Absolute expression level

n

cDNA arrays

n Two samples per experiment

n Test n Control

n Measurement is ratio of gene expression in test sample vs.

control sample

n Better reproducibility

(12)

n Wild-type vs. mutant

n Knock-out, conditional knock-out

n Overexpression construct, inducible overexpression n Detection of the targets of a transcription factor

n _{Groups of patient samples}

n Different types of tumors

n _{Multiple conditions}

n Expression patterns in the presence of drugs or toxins

n _{Time-course experiment}

n Response to a signal

n Stress

(13)

(14)

n

Raw images

(15)

n

Superimposed two channels to form color image

n Red spot: gene was only expressed in test sample

n Green: gene was only expressed in control sample

n Yellow: gene was expressed both in test and in control sample

(16)

n

Spot detection

n Spots are not perfectly circular

n Thresholding

n Fixed threshold method

n Threshold T is derived from local background mean intensity m and standard deviation σ by the relationship T = m + 3σ

n Problems due to variability of background and target signal, particularly with weak signals (frequent finding in cDNA array experiments)

n Man-Whitney method

n Take sample pixels from background and perform rank-sum hypothesis test on target pixels

(17)

n

Background intensity extraction

n Background is not uniform Õ extract local background

intensities

n Gray level histogram

n Mean local background intensity

n Standard deviation of the local background intensity

n

⇒

Treat local red background intensity and local green

(18)

n _{Spot intensity extraction}

n Unite target regions detected for red and green channel

n Probe intensity = average gray level within target region for R and G

separately

n Subtract local background values from reported probe intensities for R

and G separately

n R intensity = raw R intensity

- local R background

n G intensity = raw G intensity

- local G background

n Fluorescent intensities = significant if average spot intensity is at

least 2 standard deviations higher than corresponding average background intensity

(19)

(20)

n Calculation of the relative (R/G) expression ratios

n Mean or median R intensity / mean or median G intensity

n Mean or median of ratios of R/G intensities from every pixel location n Linear regression slope of R-G gray-values from every pixel location

n _{!!! Ideally all genes should be expressed in the control sample (G)}

n Not true in practice

n Intensity G = 0 ⇒ ratio = intensity R / 0 (! missing value)

n Reverse labeling can help here (adjust for variations between R and G

(21)

n Rescaling of different slides (experiments)

n Different labeling quality between slides can result in systematic bias in

the expression levels

n Linear regression (or smoothing curve) of all genes present in both

slides (or housekeeping genes)

n Rescaling factor = 1 / slope of regression; new intensity = old intensity *

rescaling factor

n _{Filtering of irrelevant genes}

n Retain genes with R/G ratios ≥ 2 or 3 in at least two experiments n Remove genes with missing values in x% of the experiments

n Remove low variance / standard genes (housekeeping genes)

n Log₂ transformation of the R/G ratios: log scale is more intuitive

2-fold induction constant expression 2-fold repression

R/G 2 1 1/2 ...

(22)

n

Single-slide analysis

n Identify genes that are expressed in control and not in test

sample

n Identify genes that are expressed in test sample only

n ⇒ condition-specific genes

n

Analysis of multiple slides

n Cluster analysis of gene expression profiles

n Partitions samples or genes into well-separated and

(23)

n

Which genes are considered up- or downregulated ?

n

_{VIB-MAF: mouse arrays, duplicate spots on the same}

slide

n Frequency distribution: ratio of left spot against ratio of right spot

for each gene

n Average m = 1 as expected

n Variation around m less than threefold

n ⇒ Less than threefold differential

expression levels may not be statistically reliable

(24)

(25)

(26)

(27)

n

PCA detects the directions that capture the most

information about the data

original coordinate system original coordinate system

(28)

n

PCA is performed by a linear transformation of the data

set based on the Singular Value Decomposition (SVD)

n

PCA can be applied to genes as well as samples

n “Eigengenes”

n “Eigensamples”

n

Successive principal component capture less and less

information about the data

n

We can truncate the representation of the data to a

limited number of principle components = dimensionality

reduction

(29)

n

Preprocessing of the data

n Mean (or median) centering: mean (or median) of ratios = 0

n Normalization : standard deviation of ratios = 1

n Reduces noise caused by differences in RNA yield, labeling

efficiency and image analysis

n

Cluster algorithms

n Hierarchical clustering

n K-means clustering

n Self-Organizing Maps

(30)

gene 1 gene 2 gene 3 ... gene n

gene 1 0 ...

gene 2 Dist(gene1,gene2) 0 ...

gene 3 Dist(gene1,gene3) Dist(gene2,gene3) 0 ...

... ... ... ... ...

gene n Dist(gene1,gene n) Dist(gene2,gene n) Dist(gene3,gene n) ... 0

n

Evaluate pairwise similarities between genes

n

For 2 genes X (x

₁

_,x

₂

_,...,x

_m

_{) and Y (y}

₁

_,y

₂

_,...,y

_m

_{), the Pearson}

correlation coefficient (= similarity) is

n

Euclidian distance = 1 - similarity

n

⇒

Distance matrix :

Y i N i _X i

X

Y

X

N

r

σ

)

(

)

(

1

−

=

∑

=

(31)

n

Find pair of genes with shortest distance (e.g., gene 2

and gene 3)

n

Substitute ratios for gene 2 and gene 3 in data matrix by

average ratios for gene 2 and gene 3 combined

n

⇒

Data matrix:

n

_{Recalculate pairwise similarities}

exp1 exp2 ... expm

gene 1 log2(R1/G1) log2(R1/G1) ... log2(R1/G1)

gene 2 + 3 log2(R2/G2) + log2(R3/G3) log2(R2/G2) + log2(R3/G3) ... log2(R2/G2) + log2(R3/G3)

2 2 2

... ... ... ... ...

(32)

n

⇒

Distance matrix:

n

Repeat until distance matrix contains 1 element

n

Visualization: dendrogram (= tree)

n

_{Length of each branch is indicative of distance between}

clusters

n Average distance between centers

n Maximal distance between clusters

n Minimal distance between clusters

gene 1 gene 2 + 3 ... gene n

gene 1 0 ...

gene 2 + 3 Dist(gene1,gene2+3) 0 ...

... ... ... ...

(33)

n

Visual determination

(34)

n Distance measure influences

outcome of clustering n Lack of robustness n Nonunique

(35)

1. Predefined number of clusters = 5; initialisation: randomly choose cluster centers (red points)

2. Attribute each point (gene) to cluster with closest center 3. Recalculate cluster centers

= mean expression profile of genes in cluster

4. Repeat the whole process until centers remain

stationary points with a new assignment

(36)

n _{Visualization: gene expression profiles per cluster}

n Drawbacks

n Number of clusters is predefined n All genes are assigned to a cluster n Lack of robustness

(37)

n _{Choose a geometry of nodes (e.g., 3 x 2 rectangular grid)}

n Nodes are mapped into m-dimensional space

n A data point p is selected randomly and

the nodes are moved in the direction of p

n Node closest to p (Np) is moved most

n Other nodes are moved by smaller amounts

~ distance to Np in initial grid

n 20.000 - 50.000 iterations

n _{Nodes = cluster centers and neighbouring}

(38)