• No results found

Yves Moreau and Janick Mathys Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002

N/A
N/A
Protected

Academic year: 2021

Share "Yves Moreau and Janick Mathys Master of Bioinformatics Katholieke Universiteit Leuven 2001-2002"

Copied!
38
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Yves Moreau and Janick Mathys

Master of Bioinformatics

Katholieke Universiteit Leuven

2001-2002

(2)

n

Genome-wide monitoring of gene activities by

measurement of the levels of RNA transcripts

n

Massively parallel

n

Fully automated

n

Standardizable

(3)

n Identification of regulatory mechanisms

n Modeling of genetic networks for function identification and

transgenesis

n Drug development: study gene expression in

n Disease

n Model systems n Pathogens

n Response to drug treatment

n Diagnosis

n Survey of Single Nucleotide Polymorphisms

(4)

Prediction PreventativeMedicine Follow-up Treatment Diagnostics SNPs Functional Genomics Functional Genomics Expression Monitoring Expression Monitoring

NOW

FUTURE

(5)
(6)

n

Nylon membrane

n

Spotted cDNA

(7)

n

5000 cDNAs (genes, ESTs) spotted on array

n Glass slide

n Cy3, Cy5 labeling of samples

(8)
(9)

n Silicon substrate

n In situ synthesis of oligonucleotides (25 bases) by photolithography

n Multiple probes per gene (match + mismatch)

(10)

n

Glass slide

n

Inkjet spotting

n cDNA clones (no contact = better liquid handling)

n OR in situ synthesis of long oligonucleotides (60 bases)

n

Long oligos more specific than short oligos

(11)

n

Filters and oligonucleotide arrays

n Single measurement per feature

n Absolute expression level

n

cDNA arrays

n Two samples per experiment

n Test n Control

n Measurement is ratio of gene expression in test sample vs.

control sample

n Better reproducibility

(12)

n Wild-type vs. mutant

n Knock-out, conditional knock-out

n Overexpression construct, inducible overexpression n Detection of the targets of a transcription factor

n Groups of patient samples

n Different types of tumors

n Multiple conditions

n Expression patterns in the presence of drugs or toxins

n Time-course experiment

n Response to a signal

n Stress

(13)
(14)

n

Raw images

(15)

n

Superimposed two channels to form color image

n Red spot: gene was only expressed in test sample

n Green: gene was only expressed in control sample

n Yellow: gene was expressed both in test and in control sample

(16)

n

Spot detection

n Spots are not perfectly circular

n Thresholding

n Fixed threshold method

n Threshold T is derived from local background mean intensity m and standard deviation σ by the relationship T = m + 3σ

n Problems due to variability of background and target signal, particularly with weak signals (frequent finding in cDNA array experiments)

n Man-Whitney method

n Take sample pixels from background and perform rank-sum hypothesis test on target pixels

(17)

n

Background intensity extraction

n Background is not uniform Õ extract local background

intensities

n Gray level histogram

n Mean local background intensity

n Standard deviation of the local background intensity

n

Treat local red background intensity and local green

(18)

n Spot intensity extraction

n Unite target regions detected for red and green channel

n Probe intensity = average gray level within target region for R and G

separately

n Subtract local background values from reported probe intensities for R

and G separately

n R intensity = raw R intensity

- local R background

n G intensity = raw G intensity

- local G background

n Fluorescent intensities = significant if average spot intensity is at

least 2 standard deviations higher than corresponding average background intensity

(19)
(20)

n Calculation of the relative (R/G) expression ratios

n Mean or median R intensity / mean or median G intensity

n Mean or median of ratios of R/G intensities from every pixel location n Linear regression slope of R-G gray-values from every pixel location

n !!! Ideally all genes should be expressed in the control sample (G)

n Not true in practice

n Intensity G = 0 ⇒ ratio = intensity R / 0 (! missing value)

n Reverse labeling can help here (adjust for variations between R and G

(21)

n Rescaling of different slides (experiments)

n Different labeling quality between slides can result in systematic bias in

the expression levels

n Linear regression (or smoothing curve) of all genes present in both

slides (or housekeeping genes)

n Rescaling factor = 1 / slope of regression; new intensity = old intensity *

rescaling factor

n Filtering of irrelevant genes

n Retain genes with R/G ratios ≥ 2 or 3 in at least two experiments n Remove genes with missing values in x% of the experiments

n Remove low variance / standard genes (housekeeping genes)

n Log2 transformation of the R/G ratios: log scale is more intuitive

2-fold induction constant expression 2-fold repression

R/G 2 1 1/2 ...

(22)

n

Single-slide analysis

n Identify genes that are expressed in control and not in test

sample

n Identify genes that are expressed in test sample only

n ⇒ condition-specific genes

n

Analysis of multiple slides

n Cluster analysis of gene expression profiles

n Partitions samples or genes into well-separated and

(23)

n

Which genes are considered up- or downregulated ?

n

VIB-MAF: mouse arrays, duplicate spots on the same

slide

n Frequency distribution: ratio of left spot against ratio of right spot

for each gene

n Average m = 1 as expected

n Variation around m less than threefold

n ⇒ Less than threefold differential

expression levels may not be statistically reliable

(24)
(25)
(26)
(27)

n

PCA detects the directions that capture the most

information about the data

original coordinate system original coordinate system

(28)

n

PCA is performed by a linear transformation of the data

set based on the Singular Value Decomposition (SVD)

n

PCA can be applied to genes as well as samples

n “Eigengenes”

n “Eigensamples”

n

Successive principal component capture less and less

information about the data

n

We can truncate the representation of the data to a

limited number of principle components = dimensionality

reduction

(29)

n

Preprocessing of the data

n Mean (or median) centering: mean (or median) of ratios = 0

n Normalization : standard deviation of ratios = 1

n Reduces noise caused by differences in RNA yield, labeling

efficiency and image analysis

n

Cluster algorithms

n Hierarchical clustering

n K-means clustering

n Self-Organizing Maps

(30)

gene 1 gene 2 gene 3 ... gene n

gene 1 0 ...

gene 2 Dist(gene1,gene2) 0 ...

gene 3 Dist(gene1,gene3) Dist(gene2,gene3) 0 ...

... ... ... ... ...

gene n Dist(gene1,gene n) Dist(gene2,gene n) Dist(gene3,gene n) ... 0

n

Evaluate pairwise similarities between genes

n

For 2 genes X (x

1

,x

2

,...,x

m

) and Y (y

1

,y

2

,...,y

m

), the Pearson

correlation coefficient (= similarity) is

n

Euclidian distance = 1 - similarity

n

Distance matrix :

Y i N i X i

X

Y

Y

X

N

r

σ

σ

)

(

)

(

1

1

=

=

(31)

n

Find pair of genes with shortest distance (e.g., gene 2

and gene 3)

n

Substitute ratios for gene 2 and gene 3 in data matrix by

average ratios for gene 2 and gene 3 combined

n

Data matrix:

n

Recalculate pairwise similarities

exp1 exp2 ... expm

gene 1 log2(R1/G1) log2(R1/G1) ... log2(R1/G1)

gene 2 + 3 log2(R2/G2) + log2(R3/G3) log2(R2/G2) + log2(R3/G3) ... log2(R2/G2) + log2(R3/G3)

2 2 2

... ... ... ... ...

(32)

n

Distance matrix:

n

Repeat until distance matrix contains 1 element

n

Visualization: dendrogram (= tree)

n

Length of each branch is indicative of distance between

clusters

n Average distance between centers

n Maximal distance between clusters

n Minimal distance between clusters

gene 1 gene 2 + 3 ... gene n

gene 1 0 ...

gene 2 + 3 Dist(gene1,gene2+3) 0 ...

... ... ... ...

(33)

n

Visual determination

(34)

n Distance measure influences

outcome of clustering n Lack of robustness n Nonunique

(35)

1. Predefined number of clusters = 5; initialisation: randomly choose cluster centers (red points)

2. Attribute each point (gene) to cluster with closest center 3. Recalculate cluster centers

= mean expression profile of genes in cluster

4. Repeat the whole process until centers remain

stationary points with a new assignment

(36)

n Visualization: gene expression profiles per cluster

n Drawbacks

n Number of clusters is predefined n All genes are assigned to a cluster n Lack of robustness

(37)

n Choose a geometry of nodes (e.g., 3 x 2 rectangular grid)

n Nodes are mapped into m-dimensional space

n A data point p is selected randomly and

the nodes are moved in the direction of p

n Node closest to p (Np) is moved most

n Other nodes are moved by smaller amounts

~ distance to Np in initial grid

n 20.000 - 50.000 iterations

n Nodes = cluster centers and neighbouring

(38)

Referenties

GERELATEERDE DOCUMENTEN

Even though the WASN nodes are restricted to exchange information with neighbor- ing nodes only, the use of a distributed averaging algorithm results in a CAP model estimate with

Firstly, the link between the different rank-1 approximation based noise reduction filters and the original speech distortion weighted multichannel Wiener filter is investigated

Hearing aids typically use a serial concatenation of Noise Reduction (NR) and Dynamic Range Compression (DRC).. However, the DRC in such a con- catenation negatively affects

This paper presents a variable Speech Distortion Weighted Multichannel Wiener Filter (SDW-MWF) based on soft output Voice Activity Detection (VAD) which is used for noise reduction

Once again it is clear that GIMPC2 has allowed noticeable gains in feasibility and moreover has feasible regions of similar volume to OMPC with larger numbers of d.o.f. The reader

A parallel paper (Rossiter et al., 2005) showed how one can extend the feasible regions for interpolation based predictive control far more widely than originally thought, but

In [1] the construction of controllability sets for linear systems with polytopic model uncertainty and polytopic disturbances is described. These sets do not take a given

Keywords : Predictive control, LPV systems, interpolation, computational simplicity, feasibility This paper first introduces several interpolation schemes, which have been derived