• No results found

Microarray data analysis

N/A
N/A
Protected

Academic year: 2021

Share "Microarray data analysis"

Copied!
72
0
0

Bezig met laden.... (Bekijk nu de volledige tekst)

Hele tekst

(1)

Microarray data analysis

Jonathan Pevsner, Ph.D.

Introduction to Bioinformatics pevsner@jhmi.edu

Johns Hopkins School of Public Health (260.602.01)

September 22, 2004

(2)

Copyright notice

Many of the images in this powerpoint presentation are from Bioinformatics and Functional Genomics by Jonathan Pevsner (ISBN 0-471-21004-8).

Copyright © 2003 by John Wiley & Sons, Inc.

These images and materials may not be used

without permission from the publisher. We welcome instructors to use these powerpoints for educational purposes, but please acknowledge the source.

The book has a homepage at http://www.bioinfbook.org including hyperlinks to the book chapters.

(3)

Schedule

Today : microarray data analysis (Chapter 7) Friday: computer lab (microarray data analysis) Monday: Protein analysis (Chapter 8)

Wednesday: Protein structure (Chapter 9)

(4)

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

Fig. 7.1 Page 190

(5)

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

Typically, there are many genes

(>> 10,000) and few samples (~ 10)

Fig. 7.1 Page 190

(6)

Microarray data analysis

• begin with a data matrix (gene expression values versus samples)

Preprocessing

Inferential statistics Descriptive statistics

Fig. 7.1 Page 190

(7)

Microarray data analysis: preprocessing

Observed differences in gene expression could be due to transcriptional changes, or they could be

caused by artifacts such as:

• different labeling efficiencies of Cy3, Cy5

• uneven spotting of DNA onto an array surface

• variations in RNA purity or quantity

• variations in washing efficiency

• variations in scanning efficiency

Page 191

(8)

Microarray data analysis: preprocessing

The main goal of data preprocessing is to remove the systematic bias in the data as completely as possible, while preserving the variation in gene expression that occurs because of biologically relevant changes in transcription.

A basic assumption of most normalization procedures is that the average gene expression level does not

change in an experiment.

Page 191

(9)

Data analysis: global normalization

Global normalization is used to correct two or more data sets. In one common scenario, samples are labeled with Cy3 (green dye) or Cy5 (red dye) and hybridized to DNA elements on a microrarray. After washing, probes are excited with a laser and detected with a scanning confocal microscope.

Page 192

(10)

Data analysis: global normalization

Global normalization is used to correct two or more data sets

Example: total fluorescence in Cy3 channel = 4 million units Cy 5 channel = 2 million units

Then the uncorrected ratio for a gene could show

2,000 units versus 1,000 units. This would artifactually appear to show 2-fold regulation.

Page 192

(11)

Data analysis: global normalization

Global normalization procedure

Step 1: subtract background intensity values (use a blank region of the array)

Step 2: globally normalize so that the average ratio = 1 (apply this to 1-channel or 2-channel data sets)

Page 192

(12)

Microarray data preprocessing

Some researchers use housekeeping genes for global normalization

Visit the Human Gene Expression (HuGE) Index:

www.HugeIndex.org

Page 192

(13)

Scatter plots

Useful to represent gene expression values from

two microarray experiments (e.g. control, experimental) Each dot corresponds to a gene expression value

Most dots fall along a line

Outliers represent up-regulated or down-regulated genes

Page 193

(14)

Scatter plot analysis of microarray data

Fig. 7.2 Page 193

(15)

Brain

Astrocyte Astrocyte

Fibroblast

Differential Gene Expression

in Different Tissue and Cell Types

(16)

ex pre ss ion le ve l high low

up

do w n

Expression level (sample 1)

E xp re s si o n le ve l ( sa m p le 2 )

Fig. 7.2 Page 193

(17)

Log-log

transformation

Fig. 7.3 Page 195

(18)

Scatter plots

Typically, data are plotted on log-log coordinates

Visually, this spreads out the data and offers symmetry raw ratio log2 ratio

time behavior value value

t=0 basal 1.0 0.0

t=1h no change 1.0 0.0

t=2h 2-fold up 2.0 1.0

t=3h 2-fold down 0.5 -1.0

Page 194, 197

(19)

expression level low high

up

down Mean log intensity

L o g r at io

Fig. 7.4 Page 196

(20)

SNOMAD converts array data to scatter plots http://snomad.org

-1 0 1

-1.0 -0.5 0.0 0.5 1.0

2-fold

2-fold

Log 10 (Ratio )

Mean ( Log10 ( Intensity ) )

EXP

CON

EXP

CON

EXP > CONEXP < CON

2-fold

2-fold

2-fold 2-fold

Linear-linear

plot Log-log

plot

Page 196-197

(21)

SNOMAD corrects local variance artifacts

-1 0 1

-1.0 -0.5 0.0 0.5 1.0

-1 0 1

-1.0 -0.5 0.0 0.5 1.0

2-fold

2-fold

Log 10 ( Ratio )

Mean ( Log10 ( Intensity ) )

robust local

regression fit residual

EXP > CONEXP < CON

Corrected Log10 ( Ratio ) [residuals]

Mean ( Log10 ( Intensity ) )

Page 196-197

(22)

SNOMAD describes regulated genes in Z-scores

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-2 -1 0 1 2

Corrected Log10 ( Ratio )

Mean ( Log10 ( Intensity ) )

2-fold

2-fold Locally estimated standard

deviation of positive ratios

Z= 1

Z= -1

Locally estimated standard deviation of negative ratios

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-10 -5 0 5 10

Local Log10 ( Ratio ) Z-Score

Mean ( Log10 ( Intensity ) )

Z= 5

Z= -5

-2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5

-2 -1 0 1 2

Corrected Log10 ( Ratio )

Mean ( Log10 ( Intensity ) )

2-fold

2-fold

Z= 2

Z= 1

Z= -1

Z= -2 Z= 5

Z= -5

(23)

Inferential statistics

Inferential statistics are used to make inferences about a population from a sample.

Hypothesis testing is a common form of inferential statistics. A null hypothesis is stated, such as:

“There is no difference in signal intensity for the gene expression measurements in normal and diseased samples.” The alternative hypothesis is that there is a difference.

We use a test statistic to decide whether to accept or reject the null hypothesis. For many applications,

we set the significance level  to p < 0.05.

Page 199

(24)

Inferential statistics

A t-test is a commonly used test statistic to assess the difference in mean values between two groups.

t = = Questions

Is the sample size (n) adequate?

Are the data normally distributed?

Is the variance of the data known?

Is the variance the same in the two groups?

Is it appropriate to set the significance level to p < 0.05?

Page 199

x1 – x2

difference between mean values variability (noise)

(25)

Inferential statistics

Paradigm Parametric test Nonparametric Compare two

unpaired groups Unpaired t-test Mann-Whitney test Compare two

paired groups Paired t-test Wilcoxon test

Compare 3 or ANOVA

more groups

Table 7-2

Page 198-200

(26)

Inferential statistics

Is it appropriate to set the significance level to p < 0.05?

If you hypothesize that a specific gene is up-regulated, you can set the probability value to 0.05.

You might measure the expression of 10,000 genes and hope that any of them are up- or down-regulated. But you can expect to see 5% (500 genes) regulated at the p < 0.05 level by chance alone. To account for the

thousands of repeated measurements you are making, some researchers apply a Bonferroni correction.

The level for statistical significance is divided by the number of measurements, e.g. the criterion becomes:

p < (0.05)/10,000 or p < 5 x 10-6

Page 199

(27)

Page 200

Significance analysis of microarrays (SAM)

SAM -- an Excel plug-in (URL: page 202) -- modified t-test

-- adjustable false discovery rate

(28)

Fig. 7.7 Page 202

(29)

up-

regulated

down-

regulated

expected

observed

Fig. 7.7 Page 202

(30)

Descriptive statistics

Microarray data are highly dimensional: there are

many thousands of measurements made from a small number of samples.

Descriptive (exploratory) statistics help you to find meaningful patterns in the data.

A first step is to arrange the data in a matrix.

Next, use a distance metric to define the relatedness of the different data points. Two commonly used

distance metrics are:

-- Euclidean distance

-- Pearson coefficient of correlation

Page 203

(31)

Data matrix

(20 genes and 3 time points from Chu et al.)

Fig. 7.8 Page 205

(32)

3D plot (using S-PLUS software) t=0

t=0.5 t=2.0

Fig. 7.8 Page 205

(33)

Descriptive statistics: clustering

Clustering algorithms offer useful visual descriptions of microarray data.

Genes may be clustered, or samples, or both.

We will next describe hierarchical clustering.

This may be agglomerative (building up the branches of a tree, beginning with the two most closely related objects) or divisive (building the tree by finding the most dissimilar objects first).

In each case, we end up with a tree having branches and nodes.

Page 204

(34)

Agglomerative clustering

a b c d e

a,b

4 3

2 1

0

Fig. 7.9 Page 206 Adapted from Kaufman and Rousseeuw (1990)

(35)

a b c d e

a,b

d,e

4 3

2 1

0

Agglomerative clustering

Fig. 7.9 Page 206

(36)

a b c d e

a,b

d,e

c,d,e

4 3

2 1

0

Agglomerative clustering

Fig. 7.9 Page 206

(37)

a b c d e

a,b

d,e

c,d,e

a,b,c,d,e

4 3

2 1

0

Agglomerative clustering

…tree is constructed

Fig. 7.9 Page 206

(38)

Divisive clustering a,b,c,d,e

4 3 2 1 0

Fig. 7.9 Page 206

(39)

Divisive clustering

c,d,e

a,b,c,d,e

4 3 2 1 0

Fig. 7.9 Page 206

(40)

Divisive clustering

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Fig. 7.9 Page 206

(41)

Divisive clustering a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

Fig. 7.9 Page 206

(42)

Divisive clustering a

b c d e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

…tree is constructed

Fig. 7.9 Page 206

(43)

divisive

agglomerative

a b c d e

a,b

d,e

c,d,e

a,b,c,d,e

4 3 2 1 0

4 3

2 1

0

Fig. 7.9 Page 206 Adapted from Kaufman and Rousseeuw (1990)

(44)

Fig. 7.8 Page 205

(45)

Fig. 7.10 Page 207

(46)

1

1 12

12

Agglomerative and divisive clustering

sometimes give conflicting results, as shown here

Fig. 7.10 Page 207

(47)

Cluster and TreeView

Fig. 7.11 Page 208

(48)

Cluster and TreeView

clustering

K means SOM PCA

Fig. 7.11 Page 208

(49)

Cluster and TreeView

Fig. 7.11 Page 208

(50)

Cluster and TreeView

Fig. 7.12 Page 208

(51)

Page 208

(52)

Fig. 7.12 Page 208

(53)

Fig. 7.12 Page 208

(54)

Two-way clustering

of genes (y-axis) and cell lines

(x-axis)

(Alizadeh et al., 2000)

Fig. 7.13 Page 209

(55)

Self-organizing maps (SOM)

To download GeneCluster:

http://www.genome.wi.mit.edu/MPR/software.html

Page 210

(56)

Self-organizing maps (SOM)

One chooses a geometry of 'nodes'-for example, a 3x2 grid

Formerly http://www.genome.wi.mit.edu/MPR/SOM.html

Fig. 7.15 Page 211

(57)

Self-organizing maps (SOM)

The nodes are mapped into k-dimensional space, initially at random and then successively adjusted.

Fig. 7.15 Page 211

(58)

Self-organizing maps (SOM)

Fig. 7.15 Page 211

(59)

Unlike k-means clustering, which is unstructured, SOMs allow one to impose partial structure on the clusters. The principle of SOMs is as follows.

One chooses an initial geometry of “nodes” such as a 3 x 2 rectangular grid (indicated by solid lines in the figure connecting the nodes). Hypothetical trajectories of nodes as they migrate to fit data during successive iterations of SOM algorithm are shown. Data points are represented by black dots, six nodes of SOM by large circles, and trajectories by arrows.

Fig. 7.15 Page 211

(60)

Self-organizing maps (SOM)

Neighboring nodes tend to define 'related' clusters.

An SOM based on a rectangular grid thus is analogous to an entomologist's specimen drawer in which

adjacent compartments hold similar insects.

(61)

1. Variation Filtering:

Data were passed through a variation filter to eliminate those genes showing no significant change in

expression across the k samples. This step is needed to prevent nodes from being attracted to large sets

of invariant genes.

2. Normalization:

The expression level of each gene was normalized across experiments. This focuses attention on the 'shape' of expression patterns rather than absolute levels of expression.

Two pre-processing steps essential to apply SOMs

Page 210

(62)

Principal component axis #2 (10%)

Principal component axis #1 (87%)

PC#3: 1%

C3 C4

C2

C1 N2

N3 N4 P1

P4

P2 P3

Lead (P) Sodium (N) Control (C) Legend

Principal components analysis (PCA),

an exploratory technique that reduces data dimensionality, distinguishes lead-exposed from control cell lines

Page 211

(63)

An exploratory technique used to reduce the dimensionality of the data set to 2D or 3D

For a matrix of m genes x n samples, create a new covariance matrix of size n x n

Thus transform some large number of variables into a smaller number of uncorrelated variables called principal components (PCs).

Principal components analysis (PCA)

Page 211

(64)

Principal components analysis (PCA): objectives

• to reduce dimensionality

• to determine the linear combination of variables

• to choose the most useful variables (features)

• to visualize multidimensional data

• to identify groups of objects (e.g. genes/samples)

• to identify outliers

Page 211

(65)

Page 212

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

(66)

Page 212

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

(67)

Page 212

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

(68)

Page 212

http://www.okstate.edu/artsci/botany/ordinate/PCA.htm

(69)

Fig. 7.16 Page 212

(70)

Fig. 7.16 Page 212

(71)

Chr 21

Use of PCA to demonstrate increased levels of gene expression from Down syndrome (trisomy 21) brain

(72)

Practice downloading a dataset (e.g. Chu et al. 1998) from www.dnachip.org

Try making scatter plots in Excel

Try loading the data into Avadis for advanced analyses

Friday’s computer lab

Referenties

GERELATEERDE DOCUMENTEN

Starting with the clustering of microarray data by adaptive quality-based clustering, it then retrieves the DNA sequences relating to the genes in a cluster in a semiautomated

 The literature-weighted global test can evaluate biomedical con- cepts for association with gene expression changes based on text mining-derived associations.The test uses

The raw microarray data are images, which have to be transformed into gene expression matrices, tables where rows represent genes, columns represent various samples such as tissues

Starting with the clustering of microarray data by adaptive quality-based clustering, it then retrieves the DNA sequences relating to the genes in a cluster in a semiautomated

expression level of a single gene t in a single biological condition u) based on all measurements that were obtained for this combination of gene and condition. Although

A heat map presenting the gene expression data, with a dendrogram to its side indicating the relationship between genes (or experimental conditions) is the standard way to visualize

As the final preparation before we go into deeper discussion of clustering techniques on microarray data, in Section 4 , we address some other basic but necessary ideas such as

Only more sophisticated models, such as the DSF graph model [22] are capable of generating networks that resemble the known TRNs for the set of evaluated characteristics, and only