Yves Moreau and Janick Mathys
Master of Bioinformatics
Katholieke Universiteit Leuven
2001-2002
n
Genome-wide monitoring of gene activities by
measurement of the levels of RNA transcripts
n
Massively parallel
nFully automated
nStandardizable
n Identification of regulatory mechanisms
n Modeling of genetic networks for function identification and
transgenesis
n Drug development: study gene expression in
n Disease
n Model systems n Pathogens
n Response to drug treatment
n Diagnosis
n Survey of Single Nucleotide Polymorphisms
Prediction PreventativeMedicine Follow-up Treatment Diagnostics SNPs Functional Genomics Functional Genomics Expression Monitoring Expression Monitoring
NOW
FUTURE
n
Nylon membrane
nSpotted cDNA
n
5000 cDNAs (genes, ESTs) spotted on array
n Glass slide
n Cy3, Cy5 labeling of samples
n Silicon substrate
n In situ synthesis of oligonucleotides (25 bases) by photolithography
n Multiple probes per gene (match + mismatch)
n
Glass slide
nInkjet spotting
n cDNA clones (no contact = better liquid handling)
n OR in situ synthesis of long oligonucleotides (60 bases)
n
Long oligos more specific than short oligos
n
Filters and oligonucleotide arrays
n Single measurement per feature
n Absolute expression level
n
cDNA arrays
n Two samples per experiment
n Test n Control
n Measurement is ratio of gene expression in test sample vs.
control sample
n Better reproducibility
n Wild-type vs. mutant
n Knock-out, conditional knock-out
n Overexpression construct, inducible overexpression n Detection of the targets of a transcription factor
n Groups of patient samples
n Different types of tumors
n Multiple conditions
n Expression patterns in the presence of drugs or toxins
n Time-course experiment
n Response to a signal
n Stress
n
Raw images
n
Superimposed two channels to form color image
n Red spot: gene was only expressed in test sample
n Green: gene was only expressed in control sample
n Yellow: gene was expressed both in test and in control sample
n
Spot detection
n Spots are not perfectly circular
n Thresholding
n Fixed threshold method
n Threshold T is derived from local background mean intensity m and standard deviation σ by the relationship T = m + 3σ
n Problems due to variability of background and target signal, particularly with weak signals (frequent finding in cDNA array experiments)
n Man-Whitney method
n Take sample pixels from background and perform rank-sum hypothesis test on target pixels
n
Background intensity extraction
n Background is not uniform Õ extract local background
intensities
n Gray level histogram
n Mean local background intensity
n Standard deviation of the local background intensity
n
⇒
Treat local red background intensity and local greenn Spot intensity extraction
n Unite target regions detected for red and green channel
n Probe intensity = average gray level within target region for R and G
separately
n Subtract local background values from reported probe intensities for R
and G separately
n R intensity = raw R intensity
- local R background
n G intensity = raw G intensity
- local G background
n Fluorescent intensities = significant if average spot intensity is at
least 2 standard deviations higher than corresponding average background intensity
n Calculation of the relative (R/G) expression ratios
n Mean or median R intensity / mean or median G intensity
n Mean or median of ratios of R/G intensities from every pixel location n Linear regression slope of R-G gray-values from every pixel location
n !!! Ideally all genes should be expressed in the control sample (G)
n Not true in practice
n Intensity G = 0 ⇒ ratio = intensity R / 0 (! missing value)
n Reverse labeling can help here (adjust for variations between R and G
n Rescaling of different slides (experiments)
n Different labeling quality between slides can result in systematic bias in
the expression levels
n Linear regression (or smoothing curve) of all genes present in both
slides (or housekeeping genes)
n Rescaling factor = 1 / slope of regression; new intensity = old intensity *
rescaling factor
n Filtering of irrelevant genes
n Retain genes with R/G ratios ≥ 2 or 3 in at least two experiments n Remove genes with missing values in x% of the experiments
n Remove low variance / standard genes (housekeeping genes)
n Log2 transformation of the R/G ratios: log scale is more intuitive
2-fold induction constant expression 2-fold repression
R/G 2 1 1/2 ...
n
Single-slide analysis
n Identify genes that are expressed in control and not in test
sample
n Identify genes that are expressed in test sample only
n ⇒ condition-specific genes
n
Analysis of multiple slides
n Cluster analysis of gene expression profiles
n Partitions samples or genes into well-separated and
n
Which genes are considered up- or downregulated ?
nVIB-MAF: mouse arrays, duplicate spots on the same
slide
n Frequency distribution: ratio of left spot against ratio of right spot
for each gene
n Average m = 1 as expected
n Variation around m less than threefold
n ⇒ Less than threefold differential
expression levels may not be statistically reliable
n
PCA detects the directions that capture the most
information about the data
original coordinate system original coordinate system
n
PCA is performed by a linear transformation of the data
set based on the Singular Value Decomposition (SVD)
n
PCA can be applied to genes as well as samples
n “Eigengenes”
n “Eigensamples”
n
Successive principal component capture less and less
information about the data
n
We can truncate the representation of the data to a
limited number of principle components = dimensionality
reduction
n
Preprocessing of the data
n Mean (or median) centering: mean (or median) of ratios = 0
n Normalization : standard deviation of ratios = 1
n Reduces noise caused by differences in RNA yield, labeling
efficiency and image analysis
n
Cluster algorithms
n Hierarchical clustering
n K-means clustering
n Self-Organizing Maps
gene 1 gene 2 gene 3 ... gene n
gene 1 0 ...
gene 2 Dist(gene1,gene2) 0 ...
gene 3 Dist(gene1,gene3) Dist(gene2,gene3) 0 ...
... ... ... ... ...
gene n Dist(gene1,gene n) Dist(gene2,gene n) Dist(gene3,gene n) ... 0
n
Evaluate pairwise similarities between genes
n
For 2 genes X (x
1,x
2,...,x
m) and Y (y
1,y
2,...,y
m), the Pearson
correlation coefficient (= similarity) is
n
Euclidian distance = 1 - similarity
n⇒
Distance matrix :
Y i N i X iX
Y
Y
X
N
r
σ
σ
)
(
)
(
1
1−
−
=
∑
=n
Find pair of genes with shortest distance (e.g., gene 2
and gene 3)
n
Substitute ratios for gene 2 and gene 3 in data matrix by
average ratios for gene 2 and gene 3 combined
n
⇒
Data matrix:
n
Recalculate pairwise similarities
exp1 exp2 ... expm
gene 1 log2(R1/G1) log2(R1/G1) ... log2(R1/G1)
gene 2 + 3 log2(R2/G2) + log2(R3/G3) log2(R2/G2) + log2(R3/G3) ... log2(R2/G2) + log2(R3/G3)
2 2 2
... ... ... ... ...
n
⇒
Distance matrix:
n
Repeat until distance matrix contains 1 element
nVisualization: dendrogram (= tree)
n
Length of each branch is indicative of distance between
clusters
n Average distance between centers
n Maximal distance between clusters
n Minimal distance between clusters
gene 1 gene 2 + 3 ... gene n
gene 1 0 ...
gene 2 + 3 Dist(gene1,gene2+3) 0 ...
... ... ... ...
n
Visual determination
n Distance measure influences
outcome of clustering n Lack of robustness n Nonunique
1. Predefined number of clusters = 5; initialisation: randomly choose cluster centers (red points)
2. Attribute each point (gene) to cluster with closest center 3. Recalculate cluster centers
= mean expression profile of genes in cluster
4. Repeat the whole process until centers remain
stationary points with a new assignment
n Visualization: gene expression profiles per cluster
n Drawbacks
n Number of clusters is predefined n All genes are assigned to a cluster n Lack of robustness
n Choose a geometry of nodes (e.g., 3 x 2 rectangular grid)
n Nodes are mapped into m-dimensional space
n A data point p is selected randomly and
the nodes are moved in the direction of p
n Node closest to p (Np) is moved most
n Other nodes are moved by smaller amounts
~ distance to Np in initial grid
n 20.000 - 50.000 iterations
n Nodes = cluster centers and neighbouring